YouZum

Committee

AI, Committee, Nachrichten, Uncategorized

Inside the story that enraged OpenAI

In 2019, Karen Hao, a senior reporter with MIT Technology Review, pitched me on writing a story about a then little-known company, OpenAI. It was her biggest assignment to date. Hao’s feat of reporting took a series of twists and turns over the coming months, eventually revealing how OpenAI’s ambition had taken it far afield from its original mission. The finished story was a prescient look at a company at a tipping point—or already past it. And OpenAI was not happy with the result. Hao’s new book, Empire of AI: Dreams and Nightmares in Sam Altman’s OpenAI, is an in-depth exploration of the company that kick-started the AI arms race, and what that race means for all of us. This excerpt is the origin story of that reporting. — Niall Firth, executive editor, MIT Technology Review I arrived at OpenAI’s offices on August 7, 2019. Greg Brockman, then thirty‑one, OpenAI’s chief technology officer and soon‑to‑be company president, came down the staircase to greet me. He shook my hand with a tentative smile. “We’ve never given someone so much access before,” he said. At the time, few people beyond the insular world of AI research knew about OpenAI. But as a reporter at MIT Technology Review covering the ever‑expanding boundaries of artificial intelligence, I had been following its movements closely. Until that year, OpenAI had been something of a stepchild in AI research. It had an outlandish premise that AGI could be attained within a decade, when most non‑OpenAI experts doubted it could be attained at all. To much of the field, it had an obscene amount of funding despite little direction and spent too much of the money on marketing what other researchers frequently snubbed as unoriginal research. It was, for some, also an object of envy. As a nonprofit, it had said that it had no intention to chase commercialization. It was a rare intellectual playground without strings attached, a haven for fringe ideas. But in the six months leading up to my visit, the rapid slew of changes at OpenAI signaled a major shift in its trajectory. First was its confusing decision to withhold GPT‑2 and brag about it. Then its announcement that Sam Altman, who had mysteriously departed his influential perch at YC, would step in as OpenAI’s CEO with the creation of its new “capped‑profit” structure. I had already made my arrangements to visit the office when it subsequently revealed its deal with Microsoft, which gave the tech giant priority for commercializing OpenAI’s technologies and locked it into exclusively using Azure, Microsoft’s cloud‑computing platform. Each new announcement garnered fresh controversy, intense speculation, and growing attention, beginning to reach beyond the confines of the tech industry. As my colleagues and I covered the company’s progression, it was hard to grasp the full weight of what was happening. What was clear was that OpenAI was beginning to exert meaningful sway over AI research and the way policymakers were learning to understand the technology. The lab’s decision to revamp itself into a partially for‑profit business would have ripple effects across its spheres of influence in industry and government.  So late one night, with the urging of my editor, I dashed off an email to Jack Clark, OpenAI’s policy director, whom I had spoken with before: I would be in town for two weeks, and it felt like the right moment in OpenAI’s history. Could I interest them in a profile? Clark passed me on to the communications head, who came back with an answer. OpenAI was indeed ready to reintroduce itself to the public. I would have three days to interview leadership and embed inside the company. Brockman and I settled into a glass meeting room with the company’s chief scientist, Ilya Sutskever. Sitting side by side at a long conference table, they each played their part. Brockman, the coder and doer, leaned forward, a little on edge, ready to make a good impression; Sutskever, the researcher and philosopher, settled back into his chair, relaxed and aloof. I opened my laptop and scrolled through my questions. OpenAI’s mission is to ensure beneficial AGI, I began. Why spend billions of dollars on this problem and not something else? Brockman nodded vigorously. He was used to defending OpenAI’s position. “The reason that we care so much about AGI and that we think it’s important to build is because we think it can help solve complex problems that are just out of reach of humans,” he said. He offered two examples that had become dogma among AGI believers. Climate change. “It’s a super‑complex problem. How are you even supposed to solve it?” And medicine. “Look at how important health care is in the US as a political issue these days. How do we actually get better treatment for people at lower cost?” On the latter, he began to recount the story of a friend who had a rare disorder and had recently gone through the exhausting rigmarole of bouncing between different specialists to figure out his problem. AGI would bring together all of these specialties. People like his friend would no longer spend so much energy and frustration on getting an answer. Why did we need AGI to do that instead of AI? I asked. This was an important distinction. The term AGI, once relegated to an unpopular section of the technology dictionary, had only recently begun to gain more mainstream usage—in large part because of OpenAI. And as OpenAI defined it, AGI referred to a theoretical pinnacle of AI research: a piece of software that had just as much sophistication, agility, and creativity as the human mind to match or exceed its performance on most (economically valuable) tasks. The operative word was theoretical. Since the beginning of earnest research into AI several decades earlier, debates had raged about whether silicon chips encoding everything in their binary ones and zeros could ever simulate brains and the other biological processes that give rise to what we consider intelligence. There had yet to

Inside the story that enraged OpenAI Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Can crowdsourced fact-checking curb misinformation on social media?

In a 2019 speech at Georgetown University, Mark Zuckerberg famously declared that he didn’t want Facebook to be an “arbiter of truth.” And yet, in the years since, his company, Meta, has used several methods to moderate content and identify misleading posts across its social media apps, which include Facebook, Instagram, and Threads. These methods have included automatic filters that identify illegal and malicious content, and third-party factcheckers who manually research the validity of claims made in certain posts. Zuckerberg explained that while Meta has put a lot of effort into building “complex systems to moderate content,” over the years, these systems have made many mistakes, with the result being “too much censorship.” The company therefore announced that it would be ending its third-party factchecker program in the US, replacing it with a system called Community Notes, which relies on users to flag false or misleading content and provide context about it. While Community Notes has the potential to be extremely effective, the difficult job of content moderation benefits from a mix of different approaches. As a professor of natural language processing at MBZUAI, I’ve spent most of my career researching disinformation, propaganda, and fake news online. So, one of the first questions I asked myself was: will replacing human factcheckers with crowdsourced Community Notes have negative impacts on users? Wisdom of crowds Community Notes got its start on Twitter as Birdwatch. It’s a crowdsourced feature where users who participate in the program can add context and clarification to what they deem false or misleading tweets. The notes are hidden until community evaluation reaches a consensus—meaning, people who hold different perspectives and political views agree that a post is misleading. An algorithm determines when the threshold for consensus is reached, and then the note becomes publicly visible beneath the tweet in question, providing additional context to help users make informed judgments about its content. Community Notes seems to work rather well. A team of researchers from University of Illinois Urbana-Champaign and University of Rochester found that X’s Community Notes program can reduce the spread of misinformation, leading to post retractions by authors. Facebook is largely adopting the same approach that is used on X today. Having studied and written about content moderation for years, it’s great to see another major social media company implementing crowdsourcing for content moderation. If it works for Meta, it could be a true game-changer for the more than 3 billion people who use the company’s products every day. That said, content moderation is a complex problem. There is no one silver bullet that will work in all situations. The challenge can only be addressed by employing a variety of tools that include human factcheckers, crowdsourcing, and algorithmic filtering. Each of these is best suited to different kinds of content, and can and must work in concert. Spam and LLM safety There are precedents for addressing similar problems. Decades ago, spam email was a much bigger problem than it is today. In large part, we’ve defeated spam through crowdsourcing. Email providers introduced reporting features, where users can flag suspicious emails. The more widely distributed a particular spam message is, the more likely it will be caught, as it’s reported by more people. Another useful comparison is how large language models (LLMs) approach harmful content. For the most dangerous queries—related to weapons or violence, for example—many LLMs simply refuse to answer. Other times, these systems may add a disclaimer to their outputs, such as when they are asked to provide medical, legal, or financial advice. This tiered approach is one that my colleagues and I at the MBZUAI explored in a recent study where we propose a hierarchy of ways LLMs can respond to different kinds of potentially harmful queries. Similarly, social media platforms can benefit from different approaches to content moderation. Automatic filters can be used to identify the most dangerous information, preventing users from seeing and sharing it. These automated systems are fast, but they can only be used for certain kinds of content because they aren’t capable of the nuance required for most content moderation. Crowdsourced approaches like Community Notes can flag potentially harmful content by relying on the knowledge of users. They are slower than automated systems but faster than professional factcheckers. Professional factcheckers take the most time to do their work, but the analyses they provide are deeper compared to Community Notes, which are limited to 500 characters. Factcheckers typically work as a team and benefit from shared knowledge. They are often trained to analyze the logical structure of arguments, identifying rhetorical techniques frequently employed in mis- and disinformation campaigns. But the work of professional factcheckers can’t scale in the same way Community Notes can. That’s why these three methods are most effective when they are used together. Indeed, Community Notes have been found to amplify the work done by factcheckers so it reaches more users. Another study found that Community Notes and factchecking complement each other, as they focus on different types of accounts, with Community Notes tending to analyze posts from large accounts that have high “social influence.” When Community Notes and factcheckers do converge on the same posts, their assessments are similar, however. Another study found that crowdsourced content moderation itself benefits from the findings of professional factcheckers. A path forward At its heart, content moderation is extremely difficult because it is about how we determine truth—and there is much we don’t know. Even scientific consensus, built over years by entire disciplines, can change over time. That said, platforms shouldn’t retreat from the difficult task of moderating content altogether—or become overly dependent on any single solution. They must continuously experiment, learn from their failures, and refine their strategies. As it’s been said, the difference between people who succeed and people who fail is that successful people have failed more times than others have even tried. This content was produced by the Mohamed bin Zayed University of Artificial Intelligence. It was not written by MIT Technology Review’s editorial staff.

Can crowdsourced fact-checking curb misinformation on social media? Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues. A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers. Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction. Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction. The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model. Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency. This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks appeared first on MarkTechPost.

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

The growth in developing and deploying large language models (LLMs) is closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research. A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Value (KV) stores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token (TPOT), a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware. Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges. Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attention (MLA) for memory optimization, a Mixture of Experts (MoE) framework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources. The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Prediction (MTP) module, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding. Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model. Several key takeaways from the research on insights into DeepSeek-V3 include: MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference. Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance. DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency. Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72. Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput. FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations. Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible. In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency appeared first on MarkTechPost.

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images

Manipulating lighting conditions in images post-capture is challenging. Traditional approaches rely on 3D graphics methods that reconstruct scene geometry and properties from multiple captures before simulating new lighting using physical illumination models. Though these techniques provide explicit control over light sources, recovering accurate 3D models from single images remains a problem that frequently results in unsatisfactory results. Modern diffusion-based image editing methods have emerged as alternatives that use strong statistical priors to bypass physical modeling requirements. However, these approaches struggle with precise parametric control due to their inherent stochasticity and dependence on textual conditioning. Generative image editing methods have been adapted for various relighting tasks with mixed results. Portrait relighting approaches often use light stage data to supervise generative models, while object relighting methods might fine-tune diffusion models using synthetic datasets conditioned on environment maps. Some methods assume a single dominant light source for outdoor scenes, like the sun, while indoor scenes present more complex multi-illumination challenges. Various approaches address these issues, including inverse rendering networks and methods that manipulate StyleGAN’s latent space. Flash photography research shows progress in multi-illumination editing through techniques that use flash/no-flash pairs to disentangle and manipulate scene illuminants. Researchers from Google, Tel Aviv University, Reichman University, and Hebrew University of Jerusalem have proposed LightLab, a diffusion-based method enabling explicit parametric control over light sources in images. It targets two fundamental properties of light sources, intensity and color. LightLab provides control over ambient illumination and tone mapping effects, creating a comprehensive set of editing tools that allow users to manipulate an image’s overall look and feel through illumination adjustments. The method shows effectiveness on indoor images containing visible light sources, though additional results show promise for outdoor scenes and out-of-domain examples. Comparative analysis confirms that LightLab is pioneering in delivering high-quality, precise control over visible local light sources. LightLab uses a pair of images to implicitly model controlled light changes in image space, which then trains a specialized diffusion model. The data collection combines real photographs with synthetic renderings. The photography dataset consists of 600 raw image pairs captured using mobile devices on tripods, with each pair showing identical scenes where only a visible light source is switched on or off. Auto-exposure settings and post-capture calibration ensure proper exposure. A larger set of synthetic images is rendered from 20 artist-created indoor 3D scenes to augment this collection using physically-based rendering in Blender. This synthetic pipeline randomly samples camera views around target objects and procedurally assigns light source parameters, including intensity, color temperature, area size, and cone angle. Comparative analysis shows that using a weighted mixture of real captures and synthetic renders achieves optimal results across all settings. The quantitative improvement from adding synthetic data to real captures is relatively modest at only 2.2% in PSNR, likely because significant local illumination changes are overshadowed by low-frequency image-wide details in these metrics. Qualitative comparisons on evaluation datasets show LightLab’s superiority over competing methods like OmniGen, RGB X, ScribbleLight, and IC-Light. These alternatives often introduce unwanted illumination changes, color distortion, or geometric inconsistencies. In contrast, LightLab provides faithful control over target light sources while generating physically plausible lighting effects throughout the scene. In conclusion, researchers introduced LightLab, an advancement in diffusion-based light source manipulation for images. Using light linearity principles and synthetic 3D data, the researchers created high-quality paired images that implicitly model complex illumination changes. Despite its strengths, LightLab faces limitations from dataset bias, particularly regarding light source types. This could be addressed through integration with unpaired fine-tuning methods. Moreover, while the simplistic data capture process using consumer mobile devices with post-capture exposure calibration facilitated easier dataset collection, it prevents precise relighting in absolute physical units, indicating room for further refinement in future iterations. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images appeared first on MarkTechPost.

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents

Recent advancements in LM agents have shown promising potential for automating intricate real-world tasks. These agents typically operate by proposing and executing actions through APIs, supporting applications such as software engineering, robotics, and scientific experimentation. As these tasks become more complex, LM agent frameworks have evolved to include multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance. A central challenge lies in effectively exploring and understanding the environment, which has prompted the development of engineered scaffolds using tools, memory mechanisms, and custom pipelines. However, most existing methods assume partial observability, requiring agents to collect observations incrementally. While this assumption holds in dynamic or unfamiliar environments, it is less applicable in fully observable settings like SWE-bench, where all relevant information is accessible from the start. In software engineering, research on LM agents has focused on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, allow LMs to interact autonomously with codebases, often through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Alternatively, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation. While these approaches depend on engineered components for performance, the current study proposes leveraging Long-Context LMs (LCLMs) to directly interpret the entire task environment. Advances in LCLM architecture and infrastructure now allow these models to outperform retrieval-augmented systems in many contexts, reducing reliance on complex external scaffolding.  Researchers from Stanford, IBM, and the University of Toronto explored whether complex scaffolding is necessary for LM agents tackling tasks like SWE-bench. They show that simply using LCLMs, such as Gemini-1.5-Pro, with proper prompting and no scaffolding, can achieve competitive performance—reaching 38% on SWE-Bench-Verified. Gemini-2.5-Pro, using the same simple setup, reaches 50.8%. Their work suggests that many complex agentic designs could be replaced with a single powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage approach using Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% solve rate, further supporting this simplified direction.  Traditional LM agents rely on interactive exploration due to partial observability, but many tasks, like software debugging, allow full observability. The study proposes state-in-context agents that leverage LCLMs to directly process full or compressed environment states, bypassing the need for complex agentic scaffolding. For large codebases, a ranking-based compression selects relevant files to fit within context limits. Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the full context; and SELECTSOLVE, where LCLMs localize relevant files for short-context LMs (SCLMs) to solve. Both use targeted patch formats and validation to ensure accuracy and reduce hallucination.  The experiments evaluate a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, and in SELECTSOLVE, an additional SCLM (Claude-3.7-Sonnet) for patch generation. Results show that DIRECTSOLVE outperforms complex agentic approaches like Agentless and CodeAct with minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies highlight the importance of CoT prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the start of the prompt improves performance, underscoring limitations in long-context processing.  In conclusion, the cost of using LCLM-based methods is currently higher than existing approaches like Agentless and CodeAct, averaging $2.60 per instance compared to $0.25 and $0.87, respectively. However, rapid drops in inference costs and increasing context lengths make LCLMs more practical. Techniques like KV caching significantly lower costs after initial runs, reducing it to about $0.725. Although slight codebase changes still limit caching benefits, further improvements could help. The study also suggests that LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models can perform competitively on SWE-bench tasks.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents appeared first on MarkTechPost.

SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT

OpenAI has introduced Codex, a cloud-native software engineering agent integrated into ChatGPT, signaling a new era in AI-assisted software development. Unlike traditional coding assistants, Codex is not just a tool for autocompletion—it acts as a cloud-based agent capable of autonomously performing a wide range of programming tasks, from writing and debugging code to running tests and generating pull requests. A Shift Toward Parallel, Agent-Driven Development At the core of Codex is codex-1, a fine-tuned version of OpenAI’s reasoning model, optimized specifically for software engineering workflows. Codex can handle multiple tasks simultaneously, operating inside isolated cloud sandboxes that are preloaded with the user’s codebase. Each request is handled in its own environment, allowing users to delegate different coding operations in parallel without disrupting their local development environment. This architecture introduces a fundamentally new approach to software engineering—developers now interact with an agent that behaves more like a collaborative teammate than a static code tool. You can ask Codex to “fix a bug,” “add logging,” or “refactor this module,” and it will return a verifiable response, including diffs, terminal logs, and test results. If the output looks good, you can copy the patch directly into your repository—or ask for revisions. Embedded Within ChatGPT, Accessible to Teams Codex lives in the ChatGPT interface, currently available to Pro, Team, and Enterprise users, with broader access expected soon. The interface includes a dedicated sidebar where developers can describe what they want in natural language. Codex then interprets the intent and handles the coding behind the scenes, surfacing results for review and feedback. This integration offers a significant boost to developer productivity. As OpenAI notes, Codex is designed to take on many of the repetitive or boilerplate-heavy aspects of coding—allowing developers to focus on architecture, design, and higher-order problem solving. In one case, an OpenAI staffer even “checked in two bug fixes written entirely by Codex,” all while working on unrelated tasks. Codex Understands Your Codebase What makes Codex more than just a smart code generator is its context-awareness. Each instance runs with full access to your project’s file structure, coding conventions, and style. This allows it to write code that aligns with your team’s standards—whether you’re using Flask or FastAPI, React or Vue, or a custom internal framework. Codex’s ability to adapt to a codebase makes it particularly useful for large-scale enterprise teams and open-source maintainers. It supports workflows like branch-based pull request generation, test suite execution, and static analysis—all initiated by simple English prompts. Over time, it learns the nuances of the repository it works in, leading to better suggestions and more accurate code synthesis. Broader Implications: Lowering the Barrier to Software Creation OpenAI frames Codex as a research preview, but its long-term vision is clear: AI will increasingly take over much of the routine work involved in building software. The aim isn’t to replace developers but to democratize software creation, allowing more people—especially non-traditional developers—to build working applications using natural language alone. In this light, Codex is not just a coding tool, but a stepping stone toward a world where software development is collaborative between humans and machines. It brings software creation closer to the realm of design and ideation, and further away from syntax and implementation details. What’s Next? Codex is rolling out gradually, with usage limits in place during the preview phase. OpenAI is gathering feedback to refine the agent’s capabilities, improve safety, and optimize its performance across different environments and languages. Whether you’re a solo developer, part of a DevOps team, or leading an enterprise platform, Codex represents a significant shift in how code is written, tested, and shipped. As AI agents continue to mature, the future of software engineering will be less about writing every line yourself—and more about knowing what to build, and asking the right questions. Check out the Details here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT appeared first on MarkTechPost.

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT Beitrag lesen »

de_DE