YouZum

Committee

AI, Committee, Notizie, Uncategorized

The data center boom in the desert

In the high desert east of Reno, Nevada, construction crews are flattening the golden foothills of the Virginia Range, laying the foundations of a data center city. Google, Tract, Switch, EdgeCore, Novva, Vantage, and PowerHouse are all operating, building, or expanding huge facilities within the Tahoe Reno Industrial Center, a business park bigger than the city of Detroit.  This story is a part of MIT Technology Review’s series “Power Hungry: AI and our energy future,” on the energy demands and carbon costs of the artificial-intelligence revolution. Meanwhile, Microsoft acquired more than 225 acres of undeveloped property within the center and an even larger plot in nearby Silver Springs, Nevada. Apple is expanding its data center, located just across the Truckee River from the industrial park. OpenAI has said it’s considering building a data center in Nevada as well. The corporate race to amass computing resources to train and run artificial intelligence models and store information in the cloud has sparked a data center boom in the desert—just far enough away from Nevada’s communities to elude wide notice and, some fear, adequate scrutiny.  Switch, a data center company based in Las Vegas, says the full build-out of its campus at the Tahoe Reno Industrial Center could exceed seven million square feet.EMILY NAJERA The full scale and potential environmental impacts of the developments aren’t known, because the footprint, energy needs, and water requirements are often closely guarded corporate secrets. Most of the companies didn’t respond to inquiries from MIT Technology Review, or declined to provide additional information about the projects.  But there’s “a whole lot of construction going on,” says Kris Thompson, who served as the longtime project manager for the industrial center before stepping down late last year. “The last number I heard was 13 million square feet under construction right now, which is massive.” Indeed, it’s the equivalent of almost five Empire State Buildings laid out flat. In addition, public filings from NV Energy, the state’s near-monopoly utility, reveal that a dozen data-center projects, mostly in this area, have requested nearly six gigawatts of electricity capacity within the next decade.  That would make the greater Reno area—the biggest little city in the world—one of the largest data-center markets around the globe. It would also require expanding the state’s power sector by about 40%, all for a single industry in an explosive growth stage that may, or may not, prove sustainable. The energy needs, in turn, suggest those projects could consume billions of gallons of water per year, according to an analysis conducted for this story.  Construction crews are busy building data centers throughout the Tahoe Reno Industrial Center.EMILY NAJERA The build-out of a dense cluster of energy and water-hungry data centers in a small stretch of the nation’s driest state, where climate change is driving up temperatures faster than anywhere else in the country, has begun to raise alarms among water experts, environmental groups, and residents. That includes members of the Pyramid Lake Paiute Tribe, whose namesake water body lies within their reservation and marks the end point of the Truckee River, the region’s main source of water. Much of Nevada has suffered through severe drought conditions for years, farmers and communities are drawing down many of the state’s groundwater reservoirs faster than they can be refilled, and global warming is sucking more and more moisture out of the region’s streams, shrubs, and soils. “Telling entities that they can come in and stick more straws in the ground for data centers is raising a lot of questions about sound management,” says Kyle Roerink, executive director of the Great Basin Water Network, a nonprofit that works to protect water resources throughout Nevada and Utah.  “We just don’t want to be in a situation where the tail is wagging the dog,” he later added, “where this demand for data centers is driving water policy.” Luring data centers In the late 1850s, the mountains southeast of Reno began enticing prospectors from across the country, who hoped to strike silver or gold in the famed Comstock Lode. But Storey County had few residents or economic prospects by the late 1990s, around the time when Don Roger Norman, a media-shy real estate speculator, spotted a new opportunity in the sagebrush-covered hills.  He began buying up tens of thousands of acres of land for tens of millions of dollars and lining up development approvals to lure industrial projects to what became the Tahoe Reno Industrial Center. His partners included Lance Gilman, a cowboy-hat-wearing real estate broker, who later bought the nearby Mustang Ranch brothel and won a seat as a county commissioner. In 1999, the county passed an ordinance that preapproves companies to develop most types of commercial and industrial projects across the business park, cutting months to years off the development process. That helped cinch deals with a flock of tenants looking to build big projects fast, including Walmart, Tesla, and Redwood Materials. Now the promise of fast permits is helping to draw data centers by the gigawatt. On a clear, cool January afternoon, Brian Armon, a commercial real estate broker who leads the industrial practices group at NAI Alliance, takes me on a tour of the projects around the region, which mostly entails driving around the business center. Lance Gilman, a local real estate broker, helped to develop the Tahoe Reno Industrial Center and land some of its largest tenants.GREGG SEGAL After pulling off Interstate 80 onto USA Parkway, he points out the cranes, earthmovers, and riprap foundations, where a variety of data centers are under construction. Deeper into the industrial park, Armon pulls up near Switch’s long, low, arched-roof facility, which sits on a terrace above cement walls and security gates. The Las Vegas–based company says the first phase of its data center campus encompasses more than a million square feet, and that the full build-out will cover seven times that space.  Over the next hill, we turn around in Google’s parking lot. Cranes, tents, framing, and construction equipment extend behind the

The data center boom in the desert Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Four reasons to be optimistic about AI’s energy usage

The day after his inauguration in January, President Donald Trump announced Stargate, a $500 billion initiative to build out AI infrastructure, backed by some of the biggest companies in tech. Stargate aims to accelerate the construction of massive data centers and electricity networks across the US to ensure it keeps its edge over China. This story is a part of MIT Technology Review’s series “Power Hungry: AI and our energy future,” on the energy demands and carbon costs of the artificial-intelligence revolution. The whatever-it-takes approach to the race for worldwide AI dominance was the talk of Davos, says Raquel Urtasun, founder and CEO of the Canadian robotruck startup Waabi, referring to the World Economic Forum’s annual January meeting in Switzerland, which was held the same week as Trump’s announcement. “I’m pretty worried about where the industry is going,” Urtasun says.  She’s not alone. “Dollars are being invested, GPUs are being burned, water is being evaporated—it’s just absolutely the wrong direction,” says Ali Farhadi, CEO of the Seattle-based nonprofit Allen Institute for AI. But sift through the talk of rocketing costs—and climate impact—and you’ll find reasons to be hopeful. There are innovations underway that could improve the efficiency of the software behind AI models, the computer chips those models run on, and the data centers where those chips hum around the clock. Here’s what you need to know about how energy use, and therefore carbon emissions, could be cut across all three of those domains, plus an added argument for cautious optimism: There are reasons to believe that the underlying business realities will ultimately bend toward more energy-efficient AI. 1/ More efficient models The most obvious place to start is with the models themselves—the way they’re created and the way they’re run. AI models are built by training neural networks on lots and lots of data. Large language models are trained on vast amounts of text, self-driving models are trained on vast amounts of driving data, and so on. But the way such data is collected is often indiscriminate. Large language models are trained on data sets that include text scraped from most of the internet and huge libraries of scanned books. The practice has been to grab everything that’s not nailed down, throw it into the mix, and see what comes out. This approach has certainly worked, but training a model on a massive data set over and over so it can extract relevant patterns by itself is a waste of time and energy. There might be a more efficient way. Children aren’t expected to learn just by reading everything that’s ever been written; they are given a focused curriculum. Urtasun thinks we should do something similar with AI, training models with more curated data tailored to specific tasks. (Waabi trains its robotrucks inside a superrealistic simulation that allows fine-grained control of the virtual data its models are presented with.) It’s not just Waabi. Writer, an AI startup that builds large language models for enterprise customers, claims that its models are cheaper to train and run in part because it trains them using synthetic data. Feeding its models bespoke data sets rather than larger but less curated ones makes the training process quicker (and therefore less expensive). For example, instead of simply downloading Wikipedia, the team at Writer takes individual Wikipedia pages and rewrites their contents in different formats—as a Q&A instead of a block of text, and so on—so that its models can learn more from less. Training is just the start of a model’s life cycle. As models have become bigger, they have become more expensive to run. So-called reasoning models that work through a query step by step before producing a response are especially power-hungry because they compute a series of intermediate subresponses for each response. The price tag of these new capabilities is eye-watering: OpenAI’s o3 reasoning model has been estimated to cost up to $30,000 per task to run.   But this technology is only a few months old and still experimental. Farhadi expects that these costs will soon come down. For example, engineers will figure out how to stop reasoning models from going too far down a dead-end path before they determine it’s not viable. “The first time you do something it’s way more expensive, and then you figure out how to make it smaller and more efficient,” says Farhadi. “It’s a fairly consistent trend in technology.” One way to get performance gains without big jumps in energy consumption is to run inference steps (the computations a model makes to come up with its response) in parallel, he says. Parallel computing underpins much of today’s software, especially large language models (GPUs are parallel by design). Even so, the basic technique could be applied to a wider range of problems. By splitting up a task and running different parts of it at the same time, parallel computing can generate results more quickly. It can also save energy by making more efficient use of available hardware. But it requires clever new algorithms to coordinate the multiple subtasks and pull them together into a single result at the end.  The largest, most powerful models won’t be used all the time, either. There is a lot of talk about small models, versions of large language models that have been distilled into pocket-size packages. In many cases, these more efficient models perform as well as larger ones, especially for specific use cases. As businesses figure out how large language models fit their needs (or not), this trend toward more efficient bespoke models is taking off. You don’t need an all-purpose LLM to manage inventory or to respond to niche customer queries. “There’s going to be a really, really large number of specialized models, not one God-given model that solves everything,” says Farhadi. Christina Shim, chief sustainability officer at IBM, is seeing this trend play out in the way her clients adopt the technology. She works with businesses to make sure they choose the smallest and least power-hungry models possible.

Four reasons to be optimistic about AI’s energy usage Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Can nuclear power really fuel the rise of AI?

In the AI arms race, all the major players say they want to go nuclear.   Over the past year, the likes of Meta, Amazon, Microsoft, and Google have sent out a flurry of announcements related to nuclear energy. Some are about agreements to purchase power from existing plants, while others are about investments looking to boost unproven advanced technologies. This story is a part of MIT Technology Review’s series “Power Hungry: AI and our energy future,” on the energy demands and carbon costs of the artificial-intelligence revolution. These somewhat unlikely partnerships could be a win for both the nuclear power industry and large tech companies. Tech giants need guaranteed sources of energy, and many are looking for low-emissions ones to hit their climate goals. For nuclear plant operators and nuclear technology developers, the financial support of massive established customers could help keep old nuclear power plants open and push new technologies forward. “There [are] a lot of advantages to nuclear,” says Michael Terrell, senior director of clean energy and carbon reduction at Google. Among them, he says, are that it’s “clean, firm, carbon-free, and can be sited just about anywhere.” (Firm energy sources are those that provide constant power.)  But there’s one glaring potential roadblock: timing. “There are needs on different time scales,” says Patrick White, former research director at the Nuclear Innovation Alliance. Many of these tech companies will require large amounts of power in the next three to five years, White says, but building new nuclear plants can take close to a decade.  Some next-generation nuclear technologies, especially small modular reactors, could take less time to build, but the companies promising speed have yet to build their first reactors—and in some cases they are still years away from even modestly sized demonstrations.  This timing mismatch means that even as tech companies tout plans for nuclear power, they’ll actually be relying largely on fossil fuels, keeping coal plants open, and even building new natural gas plants that could stay open for decades. AI and nuclear could genuinely help each other grow, but the reality is that the growth could be much slower than headlines suggest.  AI’s need for speed The US alone has roughly 3,000 data centers, and current projections say the AI boom could add thousands more by the end of the decade. The rush could increase global data center power demand by as much as 165% by 2030, according to one recent analysis from Goldman Sachs. In the US, estimates from industry and academia suggest energy demand for data centers could be as high as 400 terawatt-hours by 2030—up from fewer than 100 terawatt-hours in 2020 and higher than the total electricity demand from the entire country of Mexico. There are indications that the data center boom might be decelerating, with some companies slowing or pausing some projects in recent weeks. But even the most measured projections, in analyses like one recent report from the International Energy Agency, predict that energy demand will increase. The only question is by how much.   Many of the same tech giants currently scrambling to build data centers have also set climate goals, vowing to reach net-zero emissions or carbon-free energy within the next couple of decades. So they have a vested interest in where that electricity comes from.  Nuclear power has emerged as a strong candidate for companies looking to power data centers while cutting emissions. Unlike wind turbines and solar arrays that generate electricity intermittently, nuclear power plants typically put out a constant supply of energy to the grid, which aligns well with what data centers need. “Data center companies pretty much want to run full out, 24/7,” says Rob Gramlich, president of Grid Strategies, a consultancy focused on electricity and transmission. It also doesn’t hurt that, while renewables are increasingly politicized and under attack by the current administration in the US, nuclear has broad support on both sides of the aisle.  The problem is how to build up nuclear capacity—existing facilities are limited, and new technologies will take time to build. In 2022, all the nuclear reactors in the US together provided around 800 terawatt-hours of electricity to the power grid, a number that’s been basically steady for the past two decades. To meet electricity demand from data centers expected in 2030 with nuclear power, we’d need to expand the fleet of reactors in the country by half. New nuclear news  Some of the most exciting headlines regarding the burgeoning relationship between AI and nuclear technology involve large, established companies jumping in to support innovations that could bring nuclear power into the 21st century.  In October 2024, Google signed a deal with Kairos Power, a next-generation nuclear company that recently received construction approval for two demonstration reactors from the US Nuclear Regulatory Commission (NRC). The company is working to build small, molten-salt-cooled reactors, which it says will be safer and more efficient than conventional technology. The Google deal is a long-term power-purchase agreement: The tech giant will buy up to 500 megawatts of electricity by 2035 from whatever plants Kairos manages to build, with the first one scheduled to come online by 2030.  Amazon is also getting involved with next-generation nuclear technology with a direct investment in Maryland-based X-energy. The startup is among those working to create smaller, more-standardized reactors that can be built more quickly and with less expense. In October, Amazon signed a deal with Energy Northwest, a utility in Washington state, that will see Amazon fund the initial phase of a planned X-energy small modular reactor project in the state. The tech giant will have a right to buy electricity from one of the modules in the first project, which could generate 320 megawatts of electricity and be expanded to generate as much as 960 megawatts. Many new AI-focused data centers under construction will require 500 megawatts of power or more, so this project might be just large enough to power a single site.  The project will help meet energy needs “beginning in the early

Can nuclear power really fuel the rise of AI? Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Inside the story that enraged OpenAI

In 2019, Karen Hao, a senior reporter with MIT Technology Review, pitched me on writing a story about a then little-known company, OpenAI. It was her biggest assignment to date. Hao’s feat of reporting took a series of twists and turns over the coming months, eventually revealing how OpenAI’s ambition had taken it far afield from its original mission. The finished story was a prescient look at a company at a tipping point—or already past it. And OpenAI was not happy with the result. Hao’s new book, Empire of AI: Dreams and Nightmares in Sam Altman’s OpenAI, is an in-depth exploration of the company that kick-started the AI arms race, and what that race means for all of us. This excerpt is the origin story of that reporting. — Niall Firth, executive editor, MIT Technology Review I arrived at OpenAI’s offices on August 7, 2019. Greg Brockman, then thirty‑one, OpenAI’s chief technology officer and soon‑to‑be company president, came down the staircase to greet me. He shook my hand with a tentative smile. “We’ve never given someone so much access before,” he said. At the time, few people beyond the insular world of AI research knew about OpenAI. But as a reporter at MIT Technology Review covering the ever‑expanding boundaries of artificial intelligence, I had been following its movements closely. Until that year, OpenAI had been something of a stepchild in AI research. It had an outlandish premise that AGI could be attained within a decade, when most non‑OpenAI experts doubted it could be attained at all. To much of the field, it had an obscene amount of funding despite little direction and spent too much of the money on marketing what other researchers frequently snubbed as unoriginal research. It was, for some, also an object of envy. As a nonprofit, it had said that it had no intention to chase commercialization. It was a rare intellectual playground without strings attached, a haven for fringe ideas. But in the six months leading up to my visit, the rapid slew of changes at OpenAI signaled a major shift in its trajectory. First was its confusing decision to withhold GPT‑2 and brag about it. Then its announcement that Sam Altman, who had mysteriously departed his influential perch at YC, would step in as OpenAI’s CEO with the creation of its new “capped‑profit” structure. I had already made my arrangements to visit the office when it subsequently revealed its deal with Microsoft, which gave the tech giant priority for commercializing OpenAI’s technologies and locked it into exclusively using Azure, Microsoft’s cloud‑computing platform. Each new announcement garnered fresh controversy, intense speculation, and growing attention, beginning to reach beyond the confines of the tech industry. As my colleagues and I covered the company’s progression, it was hard to grasp the full weight of what was happening. What was clear was that OpenAI was beginning to exert meaningful sway over AI research and the way policymakers were learning to understand the technology. The lab’s decision to revamp itself into a partially for‑profit business would have ripple effects across its spheres of influence in industry and government.  So late one night, with the urging of my editor, I dashed off an email to Jack Clark, OpenAI’s policy director, whom I had spoken with before: I would be in town for two weeks, and it felt like the right moment in OpenAI’s history. Could I interest them in a profile? Clark passed me on to the communications head, who came back with an answer. OpenAI was indeed ready to reintroduce itself to the public. I would have three days to interview leadership and embed inside the company. Brockman and I settled into a glass meeting room with the company’s chief scientist, Ilya Sutskever. Sitting side by side at a long conference table, they each played their part. Brockman, the coder and doer, leaned forward, a little on edge, ready to make a good impression; Sutskever, the researcher and philosopher, settled back into his chair, relaxed and aloof. I opened my laptop and scrolled through my questions. OpenAI’s mission is to ensure beneficial AGI, I began. Why spend billions of dollars on this problem and not something else? Brockman nodded vigorously. He was used to defending OpenAI’s position. “The reason that we care so much about AGI and that we think it’s important to build is because we think it can help solve complex problems that are just out of reach of humans,” he said. He offered two examples that had become dogma among AGI believers. Climate change. “It’s a super‑complex problem. How are you even supposed to solve it?” And medicine. “Look at how important health care is in the US as a political issue these days. How do we actually get better treatment for people at lower cost?” On the latter, he began to recount the story of a friend who had a rare disorder and had recently gone through the exhausting rigmarole of bouncing between different specialists to figure out his problem. AGI would bring together all of these specialties. People like his friend would no longer spend so much energy and frustration on getting an answer. Why did we need AGI to do that instead of AI? I asked. This was an important distinction. The term AGI, once relegated to an unpopular section of the technology dictionary, had only recently begun to gain more mainstream usage—in large part because of OpenAI. And as OpenAI defined it, AGI referred to a theoretical pinnacle of AI research: a piece of software that had just as much sophistication, agility, and creativity as the human mind to match or exceed its performance on most (economically valuable) tasks. The operative word was theoretical. Since the beginning of earnest research into AI several decades earlier, debates had raged about whether silicon chips encoding everything in their binary ones and zeros could ever simulate brains and the other biological processes that give rise to what we consider intelligence. There had yet to

Inside the story that enraged OpenAI Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Can crowdsourced fact-checking curb misinformation on social media?

In a 2019 speech at Georgetown University, Mark Zuckerberg famously declared that he didn’t want Facebook to be an “arbiter of truth.” And yet, in the years since, his company, Meta, has used several methods to moderate content and identify misleading posts across its social media apps, which include Facebook, Instagram, and Threads. These methods have included automatic filters that identify illegal and malicious content, and third-party factcheckers who manually research the validity of claims made in certain posts. Zuckerberg explained that while Meta has put a lot of effort into building “complex systems to moderate content,” over the years, these systems have made many mistakes, with the result being “too much censorship.” The company therefore announced that it would be ending its third-party factchecker program in the US, replacing it with a system called Community Notes, which relies on users to flag false or misleading content and provide context about it. While Community Notes has the potential to be extremely effective, the difficult job of content moderation benefits from a mix of different approaches. As a professor of natural language processing at MBZUAI, I’ve spent most of my career researching disinformation, propaganda, and fake news online. So, one of the first questions I asked myself was: will replacing human factcheckers with crowdsourced Community Notes have negative impacts on users? Wisdom of crowds Community Notes got its start on Twitter as Birdwatch. It’s a crowdsourced feature where users who participate in the program can add context and clarification to what they deem false or misleading tweets. The notes are hidden until community evaluation reaches a consensus—meaning, people who hold different perspectives and political views agree that a post is misleading. An algorithm determines when the threshold for consensus is reached, and then the note becomes publicly visible beneath the tweet in question, providing additional context to help users make informed judgments about its content. Community Notes seems to work rather well. A team of researchers from University of Illinois Urbana-Champaign and University of Rochester found that X’s Community Notes program can reduce the spread of misinformation, leading to post retractions by authors. Facebook is largely adopting the same approach that is used on X today. Having studied and written about content moderation for years, it’s great to see another major social media company implementing crowdsourcing for content moderation. If it works for Meta, it could be a true game-changer for the more than 3 billion people who use the company’s products every day. That said, content moderation is a complex problem. There is no one silver bullet that will work in all situations. The challenge can only be addressed by employing a variety of tools that include human factcheckers, crowdsourcing, and algorithmic filtering. Each of these is best suited to different kinds of content, and can and must work in concert. Spam and LLM safety There are precedents for addressing similar problems. Decades ago, spam email was a much bigger problem than it is today. In large part, we’ve defeated spam through crowdsourcing. Email providers introduced reporting features, where users can flag suspicious emails. The more widely distributed a particular spam message is, the more likely it will be caught, as it’s reported by more people. Another useful comparison is how large language models (LLMs) approach harmful content. For the most dangerous queries—related to weapons or violence, for example—many LLMs simply refuse to answer. Other times, these systems may add a disclaimer to their outputs, such as when they are asked to provide medical, legal, or financial advice. This tiered approach is one that my colleagues and I at the MBZUAI explored in a recent study where we propose a hierarchy of ways LLMs can respond to different kinds of potentially harmful queries. Similarly, social media platforms can benefit from different approaches to content moderation. Automatic filters can be used to identify the most dangerous information, preventing users from seeing and sharing it. These automated systems are fast, but they can only be used for certain kinds of content because they aren’t capable of the nuance required for most content moderation. Crowdsourced approaches like Community Notes can flag potentially harmful content by relying on the knowledge of users. They are slower than automated systems but faster than professional factcheckers. Professional factcheckers take the most time to do their work, but the analyses they provide are deeper compared to Community Notes, which are limited to 500 characters. Factcheckers typically work as a team and benefit from shared knowledge. They are often trained to analyze the logical structure of arguments, identifying rhetorical techniques frequently employed in mis- and disinformation campaigns. But the work of professional factcheckers can’t scale in the same way Community Notes can. That’s why these three methods are most effective when they are used together. Indeed, Community Notes have been found to amplify the work done by factcheckers so it reaches more users. Another study found that Community Notes and factchecking complement each other, as they focus on different types of accounts, with Community Notes tending to analyze posts from large accounts that have high “social influence.” When Community Notes and factcheckers do converge on the same posts, their assessments are similar, however. Another study found that crowdsourced content moderation itself benefits from the findings of professional factcheckers. A path forward At its heart, content moderation is extremely difficult because it is about how we determine truth—and there is much we don’t know. Even scientific consensus, built over years by entire disciplines, can change over time. That said, platforms shouldn’t retreat from the difficult task of moderating content altogether—or become overly dependent on any single solution. They must continuously experiment, learn from their failures, and refine their strategies. As it’s been said, the difference between people who succeed and people who fail is that successful people have failed more times than others have even tried. This content was produced by the Mohamed bin Zayed University of Artificial Intelligence. It was not written by MIT Technology Review’s editorial staff.

Can crowdsourced fact-checking curb misinformation on social media? Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues. A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers. Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction. Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction. The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model. Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency. This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks appeared first on MarkTechPost.

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

The growth in developing and deploying large language models (LLMs) is closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research. A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Value (KV) stores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token (TPOT), a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware. Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges. Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attention (MLA) for memory optimization, a Mixture of Experts (MoE) framework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources. The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Prediction (MTP) module, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding. Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model. Several key takeaways from the research on insights into DeepSeek-V3 include: MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference. Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance. DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency. Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72. Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput. FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations. Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible. In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency appeared first on MarkTechPost.

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency Leggi l'articolo »

it_IT