AI Agent Memory Explained in 3 Levels of Difficulty
A stateless AI agent has no memory of previous calls.
AI Agent Memory Explained in 3 Levels of Difficulty Read Post »
A stateless AI agent has no memory of previous calls.
AI Agent Memory Explained in 3 Levels of Difficulty Read Post »
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. The noise we make is hurting animals. Can we learn to shut up? As human society has expanded, animals have started struggling to hear one another. For many birds, the noise has grown so loud that they’ve begun to sing with faster trills. Now, their mating calls aren’t as effective. The growing hubbub can also increase bird-on-bird conflict, and entire species that can’t handle urban clamor simply leave town for good. But there are technological solutions to the noises hurting animals—and they could help humans, too. Read the full story. —Clive Thompson Los Angeles is finally going underground In May, a new subway segment will connect downtown Los Angeles to the Pacific Ocean. What today can be an hours-long drive through a busy, museum-packed stretch of the city will be, if all goes well, a 25-minute train ride. The existence of subway stops in this part of town—known as Miracle Mile—is a technological triumph over geography and geology. Find out why. —Adam Rogers Both of these stories are from the next issue of our print magazine, which is all about nature. Subscribe now to read it when it lands tomorrow. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Apple’s Tim Cook is stepping down as CEOHardware chief John Ternus will take over from him in September. (CNN)+ Ternus’ defining challenge may be fixing Apple’s AI strategy. (CNBC)+ How does Cook compare with Apple’s other CEOs through the years? (NYT $) 2 Anthropic’s new Amazon deal escalates the compute war with OpenAIAnthropic will spend more than $100 billion on Amazon compute.(Axios $)+ OpenAI touted its compute advantage over Anthropic two weeks ago. (Bloomberg $)+ Here’s why the AI compute explosion has only just begun. (MIT Technology Review) 3 Silicon Valley is trying to get into the news businessThe latest addition is Andreessen Horowitz’s MTS. (The Information $)+ OpenAI recently bought a business talk show. (NPR)+ They join Elon Musk’s X and a new Peter Thiel-backed startup. (Axios) 4 The banking industry is scrambling to get access to Anthropic’s MythosAs regulators review the risks to financial services. (Reuters $)+ Germany’s central bank has called for wider access to Mythos. (Bloomberg $) 5 War memes are turning conflict into contentFueled by recommendation systems designed to keep you hooked. (Wired $)+ AI is turning the Iran conflict into theater. (MIT Technology Review) 6 AI is boosting worker productivity, but not their paychecksEmployees aren’t financially benefiting from their extra efficiency. (Quartz)+ New data sheds light on the current state of AI. (MIT Technology Review) 7 Amazon’s ambition to rival Starlink has hit a setbackAfter a Blue Origin rocket was grounded. (FT $) 8 Jeff Bezos’s AI lab has neared a $38 billion valuationIn an imminent $10 billion fundraising deal from investors. (FT $)+ The startup focuses on AI for engineering and manufacturing. (Reuters $) 9 Scientific AI agents have got their own social networkWhere they share, debate, and discuss research papers. (Nature) 10 A Mars rover has discovered new “origin-of-life” moleculesThey suggest Mars wasn’t always a lifeless red desert. (Gizmodo) Quote of the day “He’s been a transformational Apple CEO that’s always had a steady hand at the wheel. I think that will be his legacy. He had massive shoes to step into, and he was the right person for the job. That’s the way he’ll be remembered.” One More Thing MIKE MCQUADE The race to save our online lives from a digital dark age There is more stuff being created now than at any time in history, but our data is more fragile than ever. One day in the future, YouTube’s videos may permanently disappear. Facebook—and your uncle’s holiday posts—will vanish. For many archivists, alarm bells are ringing. Across the world, they’re scraping up defunct websites, saving at-risk data collections, and developing data storage technologies that could last thousands of years. Their work raises complex questions. What is important to us? How do we decide what to keep—and what do we let go? Read our story on the thorny problems of digital preservation. —Niall Firth We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.) + Apple’s forgotten co-founder recently shared his story of the company’s early days.+ Witness a rare underwater volcanic eruption in the Solomon Islands.+ Learn what makes Shakespeare’s writing so effective in this masterful analysis.+ An Artemis II astronaut shared a stunning iPhone video showing Earth disappear behind the Moon at 8x zoom.
The Download: turning down human noise, and LA’s stunning subway upgrade Read Post »
Tech workers in China are being instructed by their bosses to train AI agents to replace them—and it’s prompting a wave of soul-searching among otherwise enthusiastic early adopters. Earlier this month a GitHub project called Colleague Skill, which claimed workers could use it to “distill” their colleagues’ skills and personality traits and replicate them with an AI agent, went viral on Chinese social media. Though the project was created as a spoof, it struck a nerve among tech workers, a number of whom told MIT Technology Review that their bosses are encouraging them to document their workflows in order to automate specific tasks and processes using AI agent tools like OpenClaw or Claude Code. To set up Colleague Skill, a user names the coworker whose tasks they want to replicate and adds basic profile details. The tool then automatically imports chat history and files from Lark and DingTalk, both popular workplace apps in China, and generates reusable manuals describing that coworker’s duties—and even their unique quirks—for an AI agent to replicate. Colleague Skill was created by Tianyi Zhou, who works as an engineer at the Shanghai Artificial Intelligence Laboratory. Earlier this week he told Chinese outlet Southern Metropolis Daily that the project was started as a stunt, prompted by AI-related layoffs and by the growing tendency of companies to ask employees to automate themselves. He didn’t respond to requests for further comment. Internet users have found humor in the idea behind the tool, joking about automating their coworkers before themselves. However, Colleague Skill’s virality has sparked a lot of debate about workers’ dignity and individuality in the age of AI. After seeing Colleague Skill on social media, Amber Li, 27, a tech worker in Shanghai, used it to recreate a former coworker as a personal experiment. Within minutes, the tool created a file detailing how that person did their job. “It is surprisingly good,” Li says. “It even captures the person’s little quirks, like how they react and their punctuation habits.” With this skill, Li can use an AI agent as a new “coworker” that helps debug her code and replies instantly. It felt uncanny and uncomfortable, Li says. Even so, replacing coworkers with agents could become a norm. Since OpenClaw became a national craze, bosses in China have been pushing tech workers to experiment with agents. Although AI agents can take control of your computer, read and summarize news, reply to emails, and book restaurant reservations for you, tech workers on the ground say their utility has so far proven to be limited in business contexts. Asking employees to make manuals describing the minutiae of their day-to-day jobs the way Colleague Skill does is one way to help bridge that gap. Hancheng Cao, an assistant professor at Emory University who studies AI and work, believes that companies have good reasons to push employees to create work blueprints like these, beyond simply following a trend. “Firms gain not only internal experience with the tools, but also richer data on employee know-how, workflows, and decision patterns. That helps companies see which parts of work can be standardized or codified into systems, and which still depend on human judgment,” he says. To employees, though, making agents or even blueprints for them can feel strange and alienating. One software engineer, who spoke with MIT Technology Review anonymously because of concerns about their job security, trained an AI (not Colleague Skill) on their workflow and found that the process felt reductive—as if their work had been flattened into modules in a way that made them easier to replace. On social media, workers have turned to bleak humor to express similar feelings. In one comment on Rednote, a user wrote that “a cold farewell can be turned into warm tokens,” quipping that if they use Colleague Skill to distill their coworkers into tasks first, they themselves might survive a little longer. The push for creating agents has also spurred clever countermeasures. Irritated by the idea of reducing a person to a skill, Koki Xu, 26 an AI product manager in Beijing, published an “anti-distillation” skill on GitHub on April 4. The tool, which took Xu about an hour to build, is designed to sabotage the process of creating workflows for agents. Users can choose between light, medium, and heavy sabotage modes depending on how closely their boss is observing the process, and the agent rewrites the material into generic, non-actionable language that would produce a less useful AI stand-in. A video Xu posted about the project went viral, drawing more than 5 million likes across platforms. Xu told MIT Technology Review that she has been following the Colleague Skill trend from the start and that it has made her think about alienation, disempowerment, and broader implications for labor. “I originally wanted to write an op-ed, but decided it would be more useful to make something that pushes back against it,” she says. Xu, who has undergraduate and master’s degrees in law, said the trend also raises legal questions. While a company may be able to argue that work chat histories and materials created on a work laptop are corporate property, a skill like this can also capture elements of personality, tone, and judgment, making ownership much less clear. She said she hopes Colleague Skill prompts more discussion about how to protect workers’ dignity and identity in the age of AI. “I believe it’s important to keep up with these trends so we (employees) can participate in shaping how they are used,” she says. Xu herself is an avid AI adopter, with seven OpenClaw agents set up across her personal and work devices. Li, the tech worker in Shanghai, says her company has not yet found a way to replace actual workers with AI tools, largely because they remain unreliable and require constant supervision. “I don’t feel like my job is immediately at risk,” she says. “But I do feel that my value is being cheapened, and I don’t know what to do about it.”
Chinese tech workers are starting to train their AI doubles–and pushing back Read Post »
Cybersecurity has always had a dual-use problem: the same technical knowledge that helps defenders find vulnerabilities can also help attackers exploit them. For AI systems, that tension is sharper than ever. Restrictions intended to prevent harm have historically created friction for good-faith security work, and it can be genuinely difficult to tell whether any particular cyber action is intended for defensive usage or to cause harm. OpenAI is now proposing a concrete structural solution to that problem: verified identity, tiered access, and a purpose-built model for defenders. OpenAI team announced that it is scaling up its Trusted Access for Cyber (TAC) program to thousands of verified individual defenders and hundreds of teams responsible for defending critical software. The main focus of this expansion is the introduction of GPT-5.4-Cyber, a variant of GPT-5.4 fine-tuned specifically for defensive cybersecurity use cases. What Is GPT-5.4-Cyber and How Does It Differ From Standard Models? If you’re an AI engineer or data scientist who has worked with large language models on security tasks, you’re likely familiar with the frustrating experience of a model refusing to analyze a piece of malware or explain how a buffer overflow works — even in a clearly research-oriented context. GPT-5.4-Cyber is designed to eliminate that friction for verified users. Unlike standard GPT-5.4, which applies blanket refusals to many dual-use security queries, GPT-5.4-Cyber is described by OpenAI as ‘cyber-permissive’ — meaning it has a deliberately lower refusal threshold for prompts that serve a legitimate defensive purpose. That includes binary reverse engineering, enabling security professionals to analyze compiled software for malware potential, vulnerabilities, and security robustness without access to the source code. Binary reverse engineering without source code is a significant capability unlock. In practice, defenders routinely need to analyze closed-source binaries — firmware on embedded devices, third-party libraries, or suspected malware samples — without having access to the original code. That model was described as a GPT-5.4 variant purposely fine-tuned for additional cyber capabilities, with fewer capability restrictions and support for advanced defensive workflows including binary reverse engineering without source code. There are also hard limits. Users with trusted access must still abide by OpenAI’s Usage Policies and Terms of Use. The approach is designed to reduce friction for defenders while preventing prohibited behavior, including data exfiltration, malware creation or deployment, and destructive or unauthorized testing. This distinction matters: TAC lowers the refusal boundary for legitimate work, but does not suspend policy for any user. There are also deployment constraints. Use in zero-data-retention environments is limited, given that OpenAI has less visibility into the user, environment, and intent in those configurations — a tradeoff the company frames as a necessary control surface in a tiered-access model. For dev teams accustomed to running API calls in Zero-Data-Retention mode, this is an important implementation constraint to plan around before building pipelines on top of GPT-5.4-Cyber. The Tiered Access Framework: How TAC Actually Works TAC is not a checkbox feature — it is an identity-and-trust-based access framework with multiple tiers. Understanding the structure matters if you or your organization plans to integrate these capabilities. The access process runs through two paths. Individual users can verify their identity at chatgpt.com/cyber. Enterprises can request trusted access for their team through an OpenAI representative. Customers approved through either path gain access to model versions with reduced friction around safeguards that might otherwise trigger on dual-use cyber activity. Approved uses include security education, defensive programming, and responsible vulnerability research. TAC customers who want to go further and authenticate as cyber defenders can express interest in additional access tiers, including GPT-5.4-Cyber. Deployment of the more permissive model is starting with a limited, iterative rollout to vetted security vendors, organizations, and researchers. That means OpenAI is now drawing at least three practical lines instead of one: there is baseline access to general models; there is trusted access to existing models with less accidental friction for legitimate security work; and there is a higher tier of more permissive, more specialized access for vetted defenders who can justify it. The framework is grounded in three explicit principles. The first is democratized access: using objective criteria and methods, including strong KYC and identity verification, to determine who can access more advanced capabilities, with the goal of making those capabilities available to legitimate actors of all sizes, including those protecting critical infrastructure and public services. The second is iterative deployment — OpenAI updates models and safety systems as it learns more about the benefits and risks of specific versions, including improving resilience to jailbreaks and adversarial attacks. The third is ecosystem resilience, which includes targeted grants, contributions to open-source security initiatives, and tools like Codex Security. How the Safety Stack Is Built: From GPT-5.2 to GPT-5.4-Cyber It’s worth understanding how OpenAI has structured its safety architecture across model versions — because TAC is built on top of that architecture, not instead of it. OpenAI began cyber-specific safety training with GPT-5.2, then expanded it with additional safeguards through GPT-5.3-Codex and GPT-5.4. A critical milestone in that progression: GPT-5.3-Codex is the first model OpenAI is treating as High cybersecurity capability under its Preparedness Framework, which requires additional safeguards. These safeguards include training the model to refuse clearly malicious requests like stealing credentials. The Preparedness Framework is OpenAI’s internal evaluation rubric for classifying how dangerous a given capability level could be. Reaching ‘High’ under that framework is what triggered the full cybersecurity safety stack being deployed — not just model-level training, but an additional automated monitoring layer. In addition to safety training, automated classifier-based monitors detect signals of suspicious cyber activity and route high-risk traffic to a less cyber-capable model, GPT-5.2. In other words, if a request looks suspicious enough to exceed a threshold, the platform doesn’t just refuse — it silently reroutes the traffic to a safer fallback model. This is a key architectural detail: safety is enforced not only inside model weights, but also at the infrastructure routing layer. GPT-5.4-Cyber extends this stack further upward — more permissive for verified defenders, but wrapped in
If you want to capture something wolflike, it’s best to embark before dawn. So on a morning this January, with the eastern horizon still pink-hued, I drove with two young scientists into a blanket of fog. Forty miles to the west, the industrial sprawl of Houston spawned a golden glow. Tanner Broussard’s old Toyota Tacoma bumped over the levee-top roads as killdeer, flushed from their rest, flew across the beams of his headlights. Broussard peered into the darkness, looking for traps. “I have one over here,” he said, slowing slightly. A master’s student at McNeese State University, he was quiet and contemplative, his bearded face half-hidden under a black ball cap. “Nothing on it,” he said, blandly. The truck rolled on. Wolves and their relations—dogs, jackals, coyotes, and so on—are classed in the family Canidae, and the canid that dominated this landscape in eastern Texas was once the red wolf. But as soon as white settlers arrived on the continent, Canis rufus found itself under siege. The war on wolves “lasted 200 years,” federal researchers once put it, in a surprisingly evocative report. “The wolf lost.” By 1980, the red wolf was declared extinct in the wild, its population reduced to a small captive breeding population. Still, for decades afterward, people noted that strange wolflike creatures persisted along the Gulf Coast. Finally, in 2018, scientists confirmed that some local coyotes were more than coyotes: They were taller, long-legged, their coats shaded with hints of cinnamon. These animals contained relict red wolf genes. They became known as the ghost wolves. Broussard grew up in southwest Louisiana, watching coyotes trot across his parents’ ranch. The thrilling fact that these might have been not just coyotes but something more? That reset a rambling academic career. In 2023, Broussard had recently returned to college after a seven-year pause, and his budding obsession with wolves narrowed his focus. Before he finished his bachelor’s degree, he began to supply field data to a prominent conservation nonprofit. The American red wolf, Canis rufus, is the most endangered wolf species in the world. This pup is one of four animals said to be clones of this native North American species.COURTESY OF COLOSSAL BIOSCIENCES Then, last year, just before he began his master’s studies, he woke to disconcerting news. A startup called Colossal Biosciences claimed to have resuscitated the dire wolf, a large canid that went extinct more than 10,000 years ago. Pundits debated the utility of the project and whether the clones—technically, gray wolves with some genetic tweaks—could really be called dire wolves. But what mattered to Broussard was Colossal’s simultaneous announcement that it had cloned four red wolves. “That surprised pretty much everybody in the wolf community,” Broussard said as we toured the wildlife refuge where he’d set his traps. The Association of Zoos and Aquariums runs a program that sustains red wolves through captive breeding; its leadership had no idea a cloning project was underway. Nor did ecologist Joey Hinton, one of Broussard’s advisors, who had trapped the canids Colossal used to source the DNA for its clones. Some of Hinton’s former partners were collaborating with the company, but he didn’t know that clones were on the table. There was already disagreement among scientists about the entire idea of de-extinction. Now Colossal had made these mystery clones, whose location was kept secret. Even the purpose of the clones was murky to some scientists; just how they might restore red wolf populations was unclear. Red wolves had always been a contentious species, hard for scientists to pin down. The red wolf research community was already marked by the inevitable interpersonal tensions of a small and passionate group. Now Colossal’s clones became one more lightning rod. Perhaps the most curious question, though, was whether the company had cloned red wolves at all. You can think of the red wolf as the wolf of the East—an apex predator that once roamed the forests and grasslands and marshes everywhere from Texas to Illinois to New York. Smaller than a gray wolf (though a good bit larger than a coyote), this was a sleek beast, with, according to one old field guide, a “cunning fox-like appearance”: long body, long legs; clearly built to run across long distances. Its coat was smooth and flat and came in many colors: a reddish tone that comes out in the right light, yes, but also, despite the name, white and gray and, in certain regions and populations, an ominous all black. We know these details thanks to a few notes from early naturalists. As writer Andrew Moore writes in his new book, The Beasts of the East, by the time a mammalogist decided to class these eastern wolves as a standalone species in the 1930s, the red wolf had been extirpated from the East Coast and was rapidly dwindling across its range. Working with remnant skulls and other specimens, the mammalogist chose the name red wolf—which was later enshrined with the Latinate Canis rufus—because that’s what these wolves were called in the last place they survived. The looming extinction of the red wolf turned out to be a good thing for coyotes. Canis latrans is a distant relative of wolves that split away from a common ancestor thousands of years ago and might be considered, as one canid biologist put it to me, the “wolf of the Anthropocene.” Their smaller size means they need less food and can survive in smaller and more fragmented territory, the kind that modern humans tend to build. The last red wolves, which lived in Louisiana and Texas, decided a strange and smaller mate was preferable to no mate at all. Red wolves had kept coyotes out of eastern America, outcompeting them for prey. Now, as the wolves declined, the coyotes began to slip in. The last red wolves, which lived in Louisiana and Texas, decided a strange and smaller mate was preferable to no mate at all. Soon the territory became a genetic jumble, home to both
Colossal Biosciences said it cloned red wolves. Is it for real? Read Post »
Zero-shot text classification is a way to label text without first training a classifier on your own task-specific dataset.
Getting Started with Zero-Shot Text Classification Read Post »
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. No one’s sure if synthetic mirror life will kill us all In February 2019, a group of scientists proposed a high-risk, cutting-edge, irresistibly exciting idea that the National Science Foundation should fund: making “mirror” bacteria. These lab-created microbes would be organized like ordinary bacteria, but their proteins and sugars would be mirror images of those found in nature. Researchers believed they could reveal new insights into building cells, designing drugs, and even the origins of life. But now, many of them have reversed course. They’ve become convinced that mirror organisms could trigger a catastrophic event threatening every form of life on Earth. Find out why they’re ringing alarm bells. —Stephen Ornes This story is from the next issue of our print magazine, which is all about nature. Subscribe now to read it when it lands this Wednesday. Chinese tech workers are starting to train their AI doubles—and pushing back Earlier this month, a GitHub project called Colleague Skill struck a nerve by claiming to “distill” a worker’s skills and personality—and replicate them with an AI agent. Though the project was a spoof, it prompted a wave of soul-searching among otherwise enthusiastic early adopters. A number of tech workers told MIT Technology Review that their bosses are already encouraging them to document their workflows for automation via tools like OpenClaw. Many now fear that they are being flattened into code and losing their professional identity. In response, some are fighting back with tools designed to sabotage the automation process. Read the full story. —Caiwei Chen The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 The White House and Anthropic are working toward a compromiseThe Trump administration says they had a “productive meeting.” (Reuters $)+ Trump had ordered US agencies to phase out Anthropic’s tech. (Guardian)+ Despite the blacklist, the NSA is using Anthropic’s new Mythos model. (Axios) 2 Palantir has unveiled a manifesto calling for universal national serviceWhile denouncing inclusivity and “regressive” cultures. (TechCrunch)+ It’s a summary of CEO Alex Karp’s book “The Technological Republic.” (Engadget)+ One critic called the book “a piece of corporate sales material.“ (Bloomberg $) 3 Germany’s chancellor and largest company want looser AI rules Chancellor Merz said industrial AI needs more regulatory freedom. (Reuters $)+ Siemens says it plans to shift investments to the US if EU rules don’t change. (Bloomberg $)+ Fractures over AI regulation are also emerging in the US. (MIT Technology Review) 4 Nvidia’s once-tight bond with gamers is cracking over AI Consumer graphics cards are no longer the priority. (CNBC)+ But generative AI could reinvent what it means to play. (MIT Technology Review) 5 Insurers are trying to exclude AI-related harms from their coverageAnd escape legal liability for AI’s mistakes. (FT $)+ AI images are being used in insurance scams. (BBC) 6 AI is about to make the global e-waste crisis much worseAnd most of the trash will end up in non-Western countries. (Rest of World)+ Here’s what we can do about it. (MIT Technology Review) 7 Tinder and Zoom have partnered with Sam Altman’s eye-scanning firmTo offer a “proof of humanity” badge to users. (BBC) 8 Islamist insurgents in West Africa are driving surging demand for dronesA Nigerian UAV startup is opening its first factory abroad in Ghana. (Bloomberg $) 9 Hundreds of fake pro-Trump AI influencers are flooding social mediaIn an apparent bid to hook conservative voters. (NYT) 10 A Chinese humanoid has smashed the human half-marathon recordDespite crashing into a railing near the end of the race. (NBC News)+ Chinese tech firm Honor swept the podium spots. (Engadget)+ Last year, humans won the race by a mile. (CNN) Quote of the day “This is the only issue where you’ve got Steve Bannon and Ralph Nader, Glenn Beck and Bernie Sanders fighting for the same thing.” —Ben Cumming, head of communications at the AI safety nonprofit Future of Life Institute, tells the Washington Post that diverse public figures are endorsing a declaration of AI policy priorities. One More Thing NASA The great commercial takeover of low Earth orbit The International Space Station will be decommissioned as soon as 2030, but the story of America in low Earth orbit (LEO) will continue. Using lessons from the ISS, NASA has partnered with private companies to develop new commercial space stations for research, manufacturing, and tourism. If they are successful, these businesses will bring about a new era of space exploration: private rockets flying to private destinations. They will also demonstrate a new model in which NASA builds infrastructure and the private sector takes it from there—freeing the agency to explore deeper and deeper into space. Read the full story. —David W. Brown We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.) + Bask in thisadorable test of a dog’s devotion.+ This vocal pitch trainer improves your singing straight from your browser.+ Master international etiquette with this interactive guide to the world’s cultures.+ Explore the networks of public figures with this intriguing interactive graph.
The Download: murderous ‘mirror’ bacteria, and Chinese workers fighting AI doubles Read Post »
Anthropic has launched Claude Opus 4.7, it’s latest frontier model and a direct successor to Claude Opus 4.6. The release is positioned as a focused improvement rather than a full generational leap, but the gains it delivers are substantial in the areas that matter most to developers building real-world AI-powered applications: agentic software engineering, multimodal reasoning, and long-running autonomous task execution. https://www.anthropic.com/news/claude-opus-4-7 What Exactly is Claude Opus 4.7? Anthropic maintains a model family with tiers — Haiku (fast and lightweight), Sonnet (balanced), and Opus (highest capability). Opus 4.7 sits at the top of this stack, below only the newly previewed Claude Mythos, which Anthropic has kept in a restricted release. Opus 4.7 represents a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Crucially, users report being able to hand off their hardest coding work — the kind that previously needed close supervision — to Opus 4.7 with confidence, as it handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back. The model verifying its own outputs is a meaningful behavioral shift. Earlier models often produced results without internal sanity checks; Opus 4.7 appears to close that loop autonomously, which has significant implications for CI/CD pipelines and multi-step agentic workflows. Stronger Coding Benchmarks Early testers have put some sharp numbers on the coding improvements. On a 93-task coding benchmark, Opus 4.7 lifted resolution by 13% over Opus 4.6, including four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve. On CursorBench — a widely-used developer evaluation harness — Opus 4.7 cleared 70% versus Opus 4.6 at 58%. And for complex multi-step workflows, one tester observed a 14% gain over Opus 4.6 at fewer tokens and a third of the tool errors — and notably, Opus 4.7 was the first model to pass their implicit-need tests, continuing to execute through tool failures that used to stop Opus cold. Improved Vision: 3× the Resolution of Prior Models One of the most technically concrete upgrades in Opus 4.7 is its multimodal capability. Opus 4.7 can now accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many pixels as prior Claude models. Many real-world applications — from computer-use agents reading dense UI screenshots to data extraction from complex engineering diagrams — fail not because the model lacks reasoning ability, but because it can’t resolve fine visual detail. This opens up a wealth of multimodal uses that depend on fine visual detail: computer-use agents reading dense screenshots, data extractions from complex diagrams, and work that needs pixel-perfect references. The impact in production has already been dramatic. One tester working on computer-use workflows reported that Opus 4.7 scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6 — effectively eliminating their single biggest Opus pain point. This is a model-level change rather than an API parameter, so images users send to Claude will simply be processed at higher fidelity — though because higher-resolution images consume more tokens, users who don’t require the extra detail can downsample images before sending them to the model. https://www.anthropic.com/news/claude-opus-4-7 A New Effort Level: xhigh, Plus Task Budgets Developers working with the Claude API will notice two new levers for controlling compute spend. First, Opus 4.7 introduces a new xhigh (‘extra high’) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, Anthropic team has raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, Anthropic recommends starting with high or xhigh effort. Second, task budgets are now launching in public beta on the Claude Platform API, giving developers a way to guide Claude’s token spend so it can prioritize work across longer runs. Together, these two controls give developer teams meaningful production levers — especially relevant when running parallelized agent pipelines where per-call cost and latency must be managed carefully. New in Claude Code: /ultrareview and Auto Mode for Max Users Two new Claude Code features ship alongside Opus 4.7 that are worth flagging for devs who use it as part of their development workflow. The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. Anthropic is giving Pro and Max Claude Code users three free ultrareviews to try it out. Think of it as a senior engineer review pass on demand — useful before merging complex PRs or shipping to production. Additionally, auto mode has been extended to Max users. Auto mode is a new permissions option where Claude makes decisions on your behalf, meaning that you can run longer tasks with fewer interruptions — and with less risk than if you had chosen to skip all permissions. This is particularly valuable for agents executing multi-step tasks overnight or across large codebases. File System-Based Memory for Long Multi-Session Work A less-discussed but operationally significant improvement is how Opus 4.7 handles memory. Opus 4.7 is better at using file system-based memory — it remembers important notes across long, multi-session work and uses them to move on to new tasks that, as a result, need less up-front context. On third-party benchmarks, the model also achieved state-of-the-art results on GDPval-AA, a third-party evaluation of economically valuable knowledge work across finance, legal, and other domains. Key Takeaways Claude Opus 4.7 is Anthropic’s strongest coding model to date, handling complex, long-running agentic tasks with far less supervision than Opus 4.6 — and uniquely verifies its own outputs before reporting back. Vision capability has tripled, with support for images up to ~3.75 megapixels, making it significantly more reliable for computer-use agents, diagram parsing, and any workflow that depends on fine visual detail. A new xhigh effort level and task budgets give developers precise control over the reasoning-vs-latency tradeoff and token spend — critical levers
In this tutorial, we implement how to run the Bonsai 1-bit large language model efficiently using GPU acceleration and PrismML’s optimized GGUF deployment stack. We set up the environment, install the required dependencies, and download the prebuilt llama.cpp binaries, and load the Bonsai-1.7B model for fast inference on CUDA. As we progress, we examine how 1-bit quantization works under the hood, why the Q1_0_g128 format is so memory-efficient, and how this makes Bonsai practical for lightweight yet capable language model deployment. We also test core inference, benchmarking, multi-turn chat, structured JSON generation, code generation, OpenAI-compatible server mode, and a small retrieval-augmented generation workflow, giving us a complete, hands-on view of how Bonsai operates in real-world use. Copy CodeCopiedUse a different Browser import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap try: import google.colab IN_COLAB = True except ImportError: IN_COLAB = False def section(title): bar = “═” * 60 print(f”n{bar}n {title}n{bar}”) section(“1 · Environment & GPU Check”) def run(cmd, capture=False, check=True, **kw): return subprocess.run( cmd, shell=True, capture_output=capture, text=True, check=check, **kw ) gpu_info = run(“nvidia-smi –query-gpu=name,memory.total,driver_version –format=csv,noheader”, capture=True, check=False) if gpu_info.returncode == 0: print(” GPU detected:”, gpu_info.stdout.strip()) else: print(” No GPU found — inference will run on CPU (much slower).”) cuda_check = run(“nvcc –version”, capture=True, check=False) if cuda_check.returncode == 0: for line in cuda_check.stdout.splitlines(): if “release” in line: print(” CUDA:”, line.strip()) break print(f” Python {sys.version.split()[0]} | Platform: Linux (Colab)”) section(“2 · Installing Python Dependencies”) run(“pip install -q huggingface_hub requests tqdm openai”) print(” huggingface_hub, requests, tqdm, openai installed”) from huggingface_hub import hf_hub_download We begin by importing the core Python modules that we need for system operations, downloads, timing, and JSON handling. We check whether we are running inside Google Colab, define a reusable section printer, and create a helper function to run shell commands cleanly from Python. We then verify the GPU and CUDA environment, print the Python runtime details, install the required Python dependencies, and prepare the Hugging Face download utility for the next stages. Copy CodeCopiedUse a different Browser section(“3 · Downloading PrismML llama.cpp Prebuilt Binaries”) RELEASE_TAG = “prism-b8194-1179bfc” BASE_URL = f”https://github.com/PrismML-Eng/llama.cpp/releases/download/{RELEASE_TAG}” BIN_DIR = “/content/bonsai_bin” os.makedirs(BIN_DIR, exist_ok=True) def detect_cuda_build(): r = run(“nvcc –version”, capture=True, check=False) for line in r.stdout.splitlines(): if “release” in line: try: ver = float(line.split(“release”)[-1].strip().split(“,”)[0].strip()) if ver >= 13.0: return “13.1” if ver >= 12.6: return “12.8” return “12.4” except ValueError: pass return “12.4” cuda_build = detect_cuda_build() print(f” Detected CUDA build slot: {cuda_build}”) TAR_NAME = f”llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz” TAR_URL = f”{BASE_URL}/{TAR_NAME}” tar_path = f”/tmp/{TAR_NAME}” if not os.path.exists(f”{BIN_DIR}/llama-cli”): print(f” Downloading: {TAR_URL}”) urllib.request.urlretrieve(TAR_URL, tar_path) print(” Extracting …”) with tarfile.open(tar_path, “r:gz”) as t: t.extractall(BIN_DIR) for fname in os.listdir(BIN_DIR): fp = os.path.join(BIN_DIR, fname) if os.path.isfile(fp): os.chmod(fp, 0o755) print(f” Binaries extracted to {BIN_DIR}”) bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.join(BIN_DIR, f))) print(” Available:”, “, “.join(bins)) else: print(f” Binaries already present at {BIN_DIR}”) LLAMA_CLI = f”{BIN_DIR}/llama-cli” LLAMA_SERVER = f”{BIN_DIR}/llama-server” test = run(f”{LLAMA_CLI} –version”, capture=True, check=False) if test.returncode == 0: print(f” llama-cli version: {test.stdout.strip()[:80]}”) else: print(f” llama-cli test failed: {test.stderr.strip()[:200]}”) section(“4 · Downloading Bonsai-1.7B GGUF Model”) MODEL_REPO = “prism-ml/Bonsai-1.7B-gguf” MODEL_DIR = “/content/bonsai_models” GGUF_FILENAME = “Bonsai-1.7B.gguf” os.makedirs(MODEL_DIR, exist_ok=True) MODEL_PATH = os.path.join(MODEL_DIR, GGUF_FILENAME) if not os.path.exists(MODEL_PATH): print(f” Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace …”) MODEL_PATH = hf_hub_download( repo_id=MODEL_REPO, filename=GGUF_FILENAME, local_dir=MODEL_DIR, ) print(f” Model saved to: {MODEL_PATH}”) else: print(f” Model already cached: {MODEL_PATH}”) size_mb = os.path.getsize(MODEL_PATH) / 1e6 print(f” File size on disk: {size_mb:.1f} MB”) section(“5 · Core Inference Helpers”) DEFAULT_GEN_ARGS = dict( temp=0.5, top_p=0.85, top_k=20, repeat_penalty=1.0, n_predict=256, n_gpu_layers=99, ctx_size=4096, ) def build_llama_cmd(prompt, system_prompt=”You are a helpful assistant.”, **overrides): args = {**DEFAULT_GEN_ARGS, **overrides} formatted = ( f”<|im_start|>systemn{system_prompt}<|im_end|>n” f”<|im_start|>usern{prompt}<|im_end|>n” f”<|im_start|>assistantn” ) safe_prompt = formatted.replace(‘”‘, ‘\”‘) return ( f'{LLAMA_CLI} -m “{MODEL_PATH}”‘ f’ -p “{safe_prompt}”‘ f’ -n {args[“n_predict”]}’ f’ –temp {args[“temp”]}’ f’ –top-p {args[“top_p”]}’ f’ –top-k {args[“top_k”]}’ f’ –repeat-penalty {args[“repeat_penalty”]}’ f’ -ngl {args[“n_gpu_layers”]}’ f’ -c {args[“ctx_size”]}’ f’ –no-display-prompt’ f’ -e’ ) def infer(prompt, system_prompt=”You are a helpful assistant.”, verbose=True, **overrides): cmd = build_llama_cmd(prompt, system_prompt, **overrides) t0 = time.time() result = run(cmd, capture=True, check=False) elapsed = time.time() – t0 output = result.stdout.strip() if verbose: print(f”n{‘─’*50}”) print(f”Prompt : {prompt[:100]}{‘…’ if len(prompt) > 100 else ”}”) print(f”{‘─’*50}”) print(output) print(f”{‘─’*50}”) print(f” {elapsed:.2f}s | ~{len(output.split())} words”) return output, elapsed print(” Inference helpers ready.”) section(“6 · Basic Inference — Hello, Bonsai!”) infer(“What makes 1-bit language models special compared to standard models?”) We download and prepare the PrismML prebuilt llama.cpp CUDA binaries that power local inference for the Bonsai model. We detect the available CUDA version, choose the matching binary build, extract the downloaded archive, make the files executable, and verify that the llama-cli binary works correctly. After that, we download the Bonsai-1.7B GGUF model from Hugging Face, set up the model path, define the default generation settings, and build the core helper functions that format prompts and run inference. Copy CodeCopiedUse a different Browser section(“7 · Q1_0_g128 Quantization — What’s Happening Under the Hood”) print(textwrap.dedent(“”” ╔══════════════════════════════════════════════════════════════╗ ║ Bonsai Q1_0_g128 Weight Representation ║ ╠══════════════════════════════════════════════════════════════╣ ║ Each weight = 1 bit: 0 → −scale ║ ║ 1 → +scale ║ ║ Every 128 weights share one FP16 scale factor. ║ ║ ║ ║ Effective bits per weight: ║ ║ 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw ║ ║ ║ ║ Memory comparison for Bonsai-1.7B: ║ ║ FP16: 3.44 GB (1.0× baseline) ║ ║ Q1_0_g128: 0.24 GB (14.2× smaller!) ║ ║ MLX 1-bit g128: 0.27 GB (12.8× smaller) ║ ╚══════════════════════════════════════════════════════════════╝ “””)) print(” Python demo of Q1_0_g128 quantization logic:n”) import random random.seed(42) GROUP_SIZE = 128 weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)] scale = max(abs(w) for w in weights_fp16) quantized = [1 if w >= 0 else 0 for w in weights_fp16] dequantized = [scale if b == 1 else -scale for b in quantized] mse = sum((a – b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE print(f” FP16 weights (first 8): {[f'{w:.4f}’ for w in weights_fp16[:8]]}”) print(f” 1-bit repr (first 8): {quantized[:8]}”) print(f” Shared scale: {scale:.4f}”) print(f” Dequantized (first 8): {[f'{w:.4f}’ for w in dequantized[:8]]}”) print(f” MSE of reconstruction: {mse:.6f}”) memory_fp16 = GROUP_SIZE * 2 memory_1bit = GROUP_SIZE / 8 + 2 print(f”n Memory: FP16={memory_fp16}B vs Q1_0_g128={memory_1bit:.1f}B ” f”({memory_fp16/memory_1bit:.1f}× reduction)”) section(“8 ·
In this tutorial, we explore property-based testing using Hypothesis and build a rigorous testing pipeline that goes far beyond traditional unit testing. We implement invariants, differential testing, metamorphic testing, targeted exploration, and stateful testing to validate both functional correctness and behavioral guarantees of our systems. Instead of manually crafting edge cases, we let Hypothesis generate structured inputs, shrink failures to minimal counterexamples, and systematically uncover hidden bugs. Also, we demonstrate how modern testing practices can be integrated directly into experimental and research-driven workflows. Copy CodeCopiedUse a different Browser import sys, textwrap, subprocess, os, re, math !{sys.executable} -m pip -q install hypothesis pytest test_code = r”’ import re, math import pytest from hypothesis import ( given, assume, example, settings, note, target, HealthCheck, Phase ) from hypothesis import strategies as st from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize, precondition def clamp(x: int, lo: int, hi: int) -> int: if x < lo: return lo if x > hi: return hi return x def normalize_whitespace(s: str) -> str: return ” “.join(s.split()) def is_sorted_non_decreasing(xs): return all(xs[i] <= xs[i+1] for i in range(len(xs)-1)) def merge_sorted(a, b): i = j = 0 out = [] while i < len(a) and j < len(b): if a[i] <= b[j]: out.append(a[i]); i += 1 else: out.append(b[j]); j += 1 out.extend(a[i:]) out.extend(b[j:]) return out def merge_sorted_reference(a, b): return sorted(list(a) + list(b)) We set up the environment by installing Hypothesis and pytest and importing all required modules. We begin constructing the full test suite by defining core utility functions such as clamp, normalize_whitespace, and merge_sorted. We establish the functional foundation that our property-based tests will rigorously validate in later snippets. Copy CodeCopiedUse a different Browser def safe_parse_int(s: str): t = s.strip() if re.fullmatch(r”[+-]?d+”, t) is None: return (False, “not_an_int”) if len(t.lstrip(“+-“)) > 2000: return (False, “too_big”) try: return (True, int(t)) except Exception: return (False, “parse_error”) def safe_parse_int_alt(s: str): t = s.strip() if not t: return (False, “not_an_int”) sign = 1 if t[0] == “+”: t = t[1:] elif t[0] == “-“: sign = -1 t = t[1:] if not t or any(ch < “0” or ch > “9” for ch in t): return (False, “not_an_int”) if len(t) > 2000: return (False, “too_big”) val = 0 for ch in t: val = val * 10 + (ord(ch) – 48) return (True, sign * val) bounds = st.tuples(st.integers(-10_000, 10_000), st.integers(-10_000, 10_000)).map( lambda t: (t[0], t[1]) if t[0] <= t[1] else (t[1], t[0]) ) @st.composite def int_like_strings(draw): sign = draw(st.sampled_from([“”, “+”, “-“])) digits = draw(st.text(alphabet=st.characters(min_codepoint=48, max_codepoint=57), min_size=1, max_size=300)) left_ws = draw(st.text(alphabet=[” “, “t”, “n”], min_size=0, max_size=5)) right_ws = draw(st.text(alphabet=[” “, “t”, “n”], min_size=0, max_size=5)) return f”{left_ws}{sign}{digits}{right_ws}” sorted_lists = st.lists(st.integers(-10_000, 10_000), min_size=0, max_size=200).map(sorted) We implement parsing logic and define structured strategies that generate constrained, meaningful test inputs. We create composite strategies such as int_like_strings to precisely control the input space for property validation. We prepare sorted list generators and bounds strategies that enable differential and invariant-based testing. Copy CodeCopiedUse a different Browser @settings(max_examples=300, suppress_health_check=[HealthCheck.too_slow]) @given(x=st.integers(-50_000, 50_000), b=bounds) def test_clamp_within_bounds(x, b): lo, hi = b y = clamp(x, lo, hi) assert lo <= y <= hi @settings(max_examples=300, suppress_health_check=[HealthCheck.too_slow]) @given(x=st.integers(-50_000, 50_000), b=bounds) def test_clamp_idempotent(x, b): lo, hi = b y = clamp(x, lo, hi) assert clamp(y, lo, hi) == y @settings(max_examples=250) @given(s=st.text()) @example(” attb n c “) def test_normalize_whitespace_is_idempotent(s): t = normalize_whitespace(s) assert normalize_whitespace(t) == t assert normalize_whitespace(” nt ” + s + ” t”) == normalize_whitespace(s) @settings(max_examples=250, suppress_health_check=[HealthCheck.too_slow]) @given(a=sorted_lists, b=sorted_lists) def test_merge_sorted_matches_reference(a, b): out = merge_sorted(a, b) ref = merge_sorted_reference(a, b) assert out == ref assert is_sorted_non_decreasing(out) We define core property tests that validate correctness and idempotence across multiple functions. We use Hypothesis decorators to automatically explore edge cases and verify behavioral guarantees such as boundary constraints and deterministic normalization. We also implement differential testing to ensure our merge implementation matches a trusted reference. Copy CodeCopiedUse a different Browser @settings(max_examples=250, deadline=200, suppress_health_check=[HealthCheck.too_slow]) @given(s=int_like_strings()) def test_two_parsers_agree_on_int_like_strings(s): ok1, v1 = safe_parse_int(s) ok2, v2 = safe_parse_int_alt(s) assert ok1 and ok2 assert v1 == v2 @settings(max_examples=250) @given(s=st.text(min_size=0, max_size=200)) def test_safe_parse_int_rejects_non_ints(s): t = s.strip() m = re.fullmatch(r”[+-]?d+”, t) ok, val = safe_parse_int(s) if m is None: assert ok is False else: if len(t.lstrip(“+-“)) > 2000: assert ok is False and val == “too_big” else: assert ok is True and isinstance(val, int) def variance(xs): if len(xs) < 2: return 0.0 mu = sum(xs) / len(xs) return sum((x – mu) ** 2 for x in xs) / (len(xs) – 1) @settings(max_examples=250, phases=[Phase.generate, Phase.shrink]) @given(xs=st.lists(st.integers(-1000, 1000), min_size=0, max_size=80)) def test_statistics_sanity(xs): target(variance(xs)) if len(xs) == 0: assert variance(xs) == 0.0 elif len(xs) == 1: assert variance(xs) == 0.0 else: v = variance(xs) assert v >= 0.0 k = 7 assert math.isclose(variance([x + k for x in xs]), v, rel_tol=1e-12, abs_tol=1e-12) We extend our validation to parsing robustness and statistical correctness using targeted exploration. We verify that two independent integer parsers agree on structured inputs and enforce rejection rules on invalid strings. We further implement metamorphic testing by validating invariants of variance under transformation. Copy CodeCopiedUse a different Browser class Bank: def __init__(self): self.balance = 0 self.ledger = [] def deposit(self, amt: int): if amt <= 0: raise ValueError(“deposit must be positive”) self.balance += amt self.ledger.append((“dep”, amt)) def withdraw(self, amt: int): if amt <= 0: raise ValueError(“withdraw must be positive”) if amt > self.balance: raise ValueError(“insufficient funds”) self.balance -= amt self.ledger.append((“wd”, amt)) def replay_balance(self): bal = 0 for typ, amt in self.ledger: bal += amt if typ == “dep” else -amt return bal class BankMachine(RuleBasedStateMachine): def __init__(self): super().__init__() self.bank = Bank() @initialize() def init(self): assert self.bank.balance == 0 assert self.bank.replay_balance() == 0 @rule(amt=st.integers(min_value=1, max_value=10_000)) def deposit(self, amt): self.bank.deposit(amt) @precondition(lambda self: self.bank.balance > 0) @rule(amt=st.integers(min_value=1, max_value=10_000)) def withdraw(self, amt): assume(amt <= self.bank.balance) self.bank.withdraw(amt) @invariant() def balance_never_negative(self): assert self.bank.balance >= 0 @invariant() def ledger_replay_matches_balance(self): assert self.bank.replay_balance() == self.bank.balance TestBankMachine = BankMachine.TestCase ”’ path = “/tmp/test_hypothesis_advanced.py” with open(path, “w”, encoding=”utf-8″) as f: f.write(test_code) print(“Hypothesis version:”, __import__(“hypothesis”).__version__) print(“nRunning pytest on:”, path, “n”) res = subprocess.run([sys.executable, “-m”, “pytest”, “-q”, path], capture_output=True, text=True) print(res.stdout) if res.returncode != 0: print(res.stderr) if res.returncode == 0: print(“nAll Hypothesis tests passed.”) elif res.returncode == 5: