{"id":90574,"date":"2026-05-15T16:34:00","date_gmt":"2026-05-15T16:34:00","guid":{"rendered":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/"},"modified":"2026-05-15T16:34:00","modified_gmt":"2026-05-15T16:34:00","slug":"best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/","title":{"rendered":"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field"},"content":{"rendered":"<p>The AI coding agent market looks almost unrecognizable compared to 2024 or even early 2025. What started as inline autocomplete has evolved into fully autonomous systems that read GitHub issues, navigate multi-file codebases, write fixes, execute tests, and open pull requests \u2014 without a human typing a single line of code. By early 2026, roughly 85% of developers reported regularly using some form of AI assistance for coding. The category has fractured into distinct archetypes: terminal agents, AI-native IDEs, cloud-hosted autonomous engineers, and open-source frameworks that let you swap in whatever model you prefer.<\/p>\n<p>The problem is that every tool claims to be the best, and the benchmarks used to justify those claims are not always measuring the same things \u2014 and in some cases are no longer credible measures at all. This article features the most important AI coding agents by the metrics that actually matter for production software development, while being honest about where those metrics have broken down. If you are an AI\/ML engineer, software developer, or data scientist trying to decide where to invest your tooling budget in 2026, start here.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How to Read These Benchmarks \u2014 Including Why the Most-Cited One Is Now Disputed<\/strong><\/h2>\n<p>Before the listing, an important calibration on the numbers \u2014 because one major benchmark shift happened mid-cycle and is not yet reflected in most tool comparison articles.<\/p>\n<p><strong>SWE-bench Verified<\/strong> has been the industry\u2019s standard coding benchmark since mid-2024. It presents agents with 500 real GitHub issues drawn from popular Python repositories and measures whether the agent can understand the problem, navigate the codebase, generate a fix, and verify that it passes tests \u2014 end-to-end, without human guidance. It was a credible proxy. In February 2026, that changed.<\/p>\n<p>On February 23, 2026, OpenAI\u2019s Frontier Evals team <a href=\"https:\/\/openai.com\/index\/why-we-no-longer-evaluate-swe-bench-verified\/\">published a detailed post<\/a> explaining why it had stopped reporting SWE-bench Verified scores. Their auditors reviewed 138 of the hardest problems across 64 independent runs and found that 59.4% had fundamentally flawed or unsolvable test cases \u2014 tests that demanded exact function names not mentioned in the problem statement, or checked unrelated behavior pulled from upstream pull requests. More critically, they found evidence that every major frontier model \u2014 GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash \u2014 could reproduce the gold-patch solutions verbatim from memory using only the task ID, confirming systematic training data contamination. OpenAI\u2019s conclusion: \u201cImprovements on SWE-bench Verified no longer reflect meaningful improvements in models\u2019 real-world software development abilities.\u201d OpenAI now recommends SWE-bench Pro as the replacement for frontier coding evaluation.<\/p>\n<p>This does not make SWE-bench Verified scores useless. Other major labs continue to report them, third-party evaluators continue to run them, and they remain useful for broad directional comparison. But any ranking that presents SWE-bench Verified scores as clean, objective measurements of real-world ability \u2014 without this caveat \u2014 is giving you an incomplete picture. All scores in this article are flagged accordingly.<\/p>\n<p><strong>SWE-bench Pro<\/strong> is harder to interpret than Verified because published results vary significantly by split, scaffold, harness, and reporting source. The benchmark contains 1,865 total tasks divided into a 731-task public set, an 858-task held-out set, and a 276-task commercial\/private set drawn from 18 proprietary startup codebases. When the <a href=\"https:\/\/arxiv.org\/html\/2509.16941v1\">original Scale AI paper<\/a> measured frontier models using a unified SWE-Agent scaffold, top scores were below 25% \u2014 GPT-5 at 23.3% \u2014 reflecting a genuinely harder evaluation. However, current <a href=\"https:\/\/labs.scale.com\/leaderboard\/swe_bench_pro_public\">public leaderboard<\/a> and vendor-reported runs now show substantially higher scores under newer models and optimized agent harnesses: OpenAI reports GPT-5.5 at 58.6% on SWE-bench Pro (Public), while Anthropic\u2019s comparison table lists Claude Opus 4.7 at 64.3% and Gemini 3.1 Pro at 54.2%. These numbers should not be directly compared with the original sub-25% SWE-Agent results without noting the scaffold and split differences \u2014 the benchmark has not changed, but the evaluation conditions and model generations have. When you see a 60%+ SWE-bench Pro score alongside a sub-25% one, they are measuring the same benchmark under very different conditions, not two separate tests.<\/p>\n<p><strong>Terminal-Bench 2.0<\/strong> evaluates terminal-native workflows: shell scripting, file system operations, environment setup, and DevOps automation. As of April 23, 2026, GPT-5.5 leads at 82.7% on this benchmark \u2014 confirmed in <a href=\"https:\/\/openai.com\/index\/introducing-gpt-5-5\/\">OpenAI\u2019s official release<\/a>. Claude Opus 4.7 scores 69.4% (Anthropic\/<a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock\/\">AWS-reported<\/a>), and Gemini 3.1 Pro scores 68.5%. An important methodological caveat: different harnesses produce different numbers for the same model. Anthropic\u2019s Opus 4.6 system card showed GPT-5.2-Codex scoring 57.5% on the independent Terminus-2 harness vs 64.7% on OpenAI\u2019s own Codex CLI harness \u2014 a 7-point gap from harness alone. When comparing Terminal-Bench figures across sources, always check which execution environment was used.<\/p>\n<p>One final cross-benchmark caveat: agent scaffolding matters as much as the underlying model. In a February 2026 evaluation of 731 problems, three different agent frameworks running the same Opus 4.5 model scored 17 issues apart \u2014 a 2.3-point gap that changes relative rankings. A benchmark score labeled with a model name reflects the model <em>and<\/em> the specific scaffold wrapped around it, not the model in isolation.<\/p>\n<h2 class=\"wp-block-heading\"><strong>10 AI Agents for Software Development<\/strong><\/h2>\n<h3 class=\"wp-block-heading\"><strong>A Note on Claude Mythos Preview<\/strong><\/h3>\n<p>The current leader on SWE-bench Verified among third-party trackers is <strong>Claude Mythos Preview at 93.9%<\/strong>, announced April 7, 2026 under Anthropic\u2019s <a href=\"https:\/\/www.anthropic.com\/glasswing\">Project Glasswing<\/a>. It is not generally available. Access is restricted to a limited set of platform partners; Anthropic has stated it does not plan broad release in the near term, in part due to elevated cybersecurity capability concerns. It sits outside the main comparison below because developers cannot access it through standard channels. Its existence does, however, signal that the practical capability ceiling sits substantially above what any publicly available tool currently delivers.<\/p>\n<h3 class=\"wp-block-heading\">#<strong>1. <a href=\"https:\/\/www.anthropic.com\/product\/claude-code\" target=\"_blank\" rel=\"noreferrer noopener\">Claude Code (Anthropic)<\/a><\/strong><\/h3>\n<p><strong>SWE-bench Verified (self-reported):<\/strong> 87.6% (Opus 4.7) \/ 80.8% (Opus 4.6) <strong>SWE-bench Pro (Anthropic internal variant):<\/strong> 64.3% (Opus 4.7, #1) \/ 53.4% (Opus 4.6) <strong>Terminal-Bench 2.0:<\/strong> 69.4% (Opus 4.7, Anthropic-reported) <strong>CursorBench:<\/strong> 70% (Opus 4.7, Cursor-reported) <strong>Claude Code subscription:<\/strong> $20\u2013$200\/month | <strong>Opus 4.7 API:<\/strong> $5\/$25 per million tokens<\/p>\n<p>Claude Code is Anthropic\u2019s terminal-native coding agent and the leader on code quality metrics across most self-reported and third-party evaluations as of May 2026. It runs from the command line, integrates with VS Code and JetBrains via extension, and is built around <a href=\"https:\/\/www.anthropic.com\/news\/claude-opus-4-7\">Claude Opus 4.7<\/a> \u2014 released April 16, 2026.<\/p>\n<p>Opus 4.7 represents a step-change over its predecessor. SWE-bench Verified jumped from 80.8% to 87.6% \u2014 a nearly 7-point gain. On Anthropic\u2019s internal SWE-bench Pro variant, the model moved from 53.4% to 64.3%, an 11-point gain that puts it ahead of every current publicly available competitor on that harder benchmark. On CursorBench, Cursor\u2019s CEO reported Opus 4.7 at 70%, up from 58% for Opus 4.6. Rakuten reported 3\u00d7 more production tasks resolved on their internal SWE-bench variant; CodeRabbit reported over 10% recall improvement on complex PR reviews with stable precision.<\/p>\n<p>Opus 4.7 introduced self-verification behavior: the model writes tests, runs them, and fixes failures before surfacing results, rather than waiting for external feedback. It also introduced multi-agent coordination \u2014 the ability to orchestrate parallel AI workstreams rather than processing tasks sequentially \u2014 which matters for teams running code review, documentation, and data processing simultaneously. The 1 million token context window can support much larger repository contexts than shorter-window tools, though very large monorepos still benefit from indexing, retrieval, or file selection strategies to stay within practical limits.<\/p>\n<p>One important pricing distinction: Claude Code subscription tiers ($20\u2013$200\/month) are what individual developers pay to use Claude Code in the CLI and IDE integrations. The underlying Opus 4.7 API is priced at $5 per million input tokens and $25 per million output tokens \u2014 unchanged from Opus 4.6 \u2014 with a batch API discount of 50% and prompt caching reducing costs further. Teams building custom agents on top of the Anthropic API are not paying the subscription rate.<\/p>\n<p>On Terminal-Bench 2.0, Opus 4.7 scores 69.4% \u2014 strong, but GPT-5.5 has since moved ahead on this specific benchmark at 82.7%. For pure terminal\/DevOps agentic workflows, that gap is worth considering.<\/p>\n<p><strong>Best for:<\/strong> Developers working on complex multi-file engineering tasks, large codebases, or long-horizon refactoring who prioritize output quality over speed.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#2. <a href=\"https:\/\/openai.com\/codex\/\" target=\"_blank\" rel=\"noreferrer noopener\">OpenAI Codex (OpenAI)<\/a><\/strong><\/h3>\n<p><strong>Terminal-Bench 2.0 (GPT-5.5):<\/strong> 82.7% \u2014 current #1 <strong>SWE-bench Pro Public (OpenAI-reported, GPT-5.5):<\/strong> 58.6% <strong>SWE-bench Verified (third-party trackers, GPT-5.5):<\/strong> ~88.7% (OpenAI does not self-report) <strong>Pricing:<\/strong> Codex CLI is open-source (model usage requires a ChatGPT plan or API key); GPT-5.5 in Codex available on Plus ($20\/month), Pro ($200\/month), Business, Enterprise, Edu, and Go plans; API: $5\/$30 per million tokens (gpt-5.5)<\/p>\n<p>An important correction to many comparisons of Codex: <strong>the Codex CLI is a local tool that runs on your machine<\/strong>, not a cloud-sandboxed system. The Codex CLI (available on GitHub as <a href=\"https:\/\/github.com\/openai\/codex\"><code>openai\/codex<\/code><\/a>) runs a local agent loop in your terminal, using OpenAI\u2019s API for model inference. The cloud execution surface \u2014 where tasks run in an isolated VM without touching your local environment \u2014 is the Codex web product and IDE integrations, not the CLI. This distinction matters for security, network access, and cost modeling.<\/p>\n<p><a href=\"https:\/\/openai.com\/index\/introducing-gpt-5-5\/\">GPT-5.5 launched April 23, 2026<\/a> and is OpenAI\u2019s most capable coding model to date. On Terminal-Bench 2.0, it scores 82.7% \u2014 the current #1 position across all publicly available models, ahead of Claude Opus 4.7 (69.4%) and Gemini 3.1 Pro (68.5%). OpenAI describes Terminal-Bench as the more representative benchmark for the kind of work Codex actually does: \u201ccomplex command-line workflows requiring planning, iteration, and tool coordination.\u201d On SWE-bench Pro (Public), GPT-5.5 scores 58.6% per OpenAI\u2019s release data, behind Claude Opus 4.7 (64.3%) but ahead of earlier GPT generations. Claude Opus 4.7 still leads on code quality for multi-file, long-horizon software engineering; GPT-5.5 leads on terminal-native, DevOps-style agentic execution.<\/p>\n<p>Note on SWE-bench Verified: OpenAI <a href=\"https:\/\/openai.com\/index\/why-we-no-longer-evaluate-swe-bench-verified\/\">stopped self-reporting this metric<\/a> in February 2026 due to contamination concerns. Third-party trackers show GPT-5.5 around 88.7%, but OpenAI\u2019s official position is that this benchmark is no longer a reliable frontier measure. They report SWE-bench Pro instead.<\/p>\n<p>GPT-5.5 is available in ChatGPT (Plus, Pro, Business, Enterprise, Edu) and across Codex (CLI, IDE extensions, and the Codex web product). API access was announced and is rolling out. API pricing: $5\/$30 per million tokens for gpt-5.5, a 2\u00d7 jump from GPT-5.4. More than 85% of OpenAI employees now use Codex weekly \u2014 a signal of internal confidence in the product beyond benchmark numbers.<\/p>\n<p><strong>Best for:<\/strong> Developers focused on terminal-native, DevOps, and pipeline automation workflows where Terminal-Bench performance is the primary signal; also the strongest choice for fire-and-forget execution via the Codex web product.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#3. <a href=\"https:\/\/cursor.com\/\">Cursor<\/a><\/strong><\/h3>\n<p><strong>SWE-bench Verified:<\/strong> ~51.7% (default config; rises substantially with Opus 4.7 backend) <strong>Task completion speed:<\/strong> ~30% faster than GitHub Copilot in head-to-head testing <strong>ARR:<\/strong> $2 billion (February 2026) <strong>Pricing:<\/strong> $20\/month (Pro), $60\/month (Pro+), Enterprise tiers above<\/p>\n<p>Cursor reached $2 billion ARR in February 2026 \u2014 doubling from $1 billion in November 2025 \u2014 and is reportedly in talks to raise approximately $2 billion at a $50 billion-plus valuation, with Thrive Capital and Andreessen Horowitz. These figures reflect real developer adoption, not benchmark-driven hype.<\/p>\n<p>Cursor\u2019s SWE-bench figure (~51.7%) represents its default model configuration. Because Cursor is model-agnostic and supports Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, and Grok, its effective benchmark ceiling scales with the model selected \u2014 a developer running Cursor with Opus 4.7 gets materially different performance from one using a default configuration. The 30% task completion speed advantage over Copilot reflects Cursor\u2019s editor-native architecture, which eliminates context-switching overhead between a terminal agent and a separate IDE.<\/p>\n<p>Cursor is a VS Code fork rebuilt around AI at every layer. Its Plan\/Act mode gives developers a structured workflow: plan, review, then execute. Background Agents (Pro+ tier, $60\/month) run autonomous coding sessions on cloud VMs in parallel, without blocking the main editor. Per-task model selection \u2014 fast model for autocomplete, reasoning-heavy model for complex edits \u2014 gives fine-grained cost control.<\/p>\n<p>Cursor is its own editor, not a plugin. Developers using JetBrains, Neovim, or Xcode cannot use Cursor without switching editors. That constraint is real and limits its enterprise footprint compared to Copilot.<\/p>\n<p><strong>Best for:<\/strong> VS Code-native developers who want the best AI-native IDE experience and are willing to pay for the integrated workflow.<\/p>\n<h3 class=\"wp-block-heading\">#<strong>4. <a href=\"https:\/\/geminicli.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Gemini CLI (Google DeepMind)<\/a><\/strong><\/h3>\n<p><strong>SWE-bench Verified (Gemini 3.1 Pro):<\/strong> 80.6% <strong>Terminal-Bench 2.0 (Gemini 3.1 Pro):<\/strong> 68.5% <strong>Context Window:<\/strong> 1 million tokens <strong>Pricing:<\/strong> Free tier via Google AI Studio; Google One AI Premium for higher limits<\/p>\n<p>Gemini CLI is Google DeepMind\u2019s open-source coding agent (<code>npm install -g @google\/gemini-cli<\/code>). Its primary model is <strong><a href=\"https:\/\/deepmind.google\/models\/model-cards\/gemini-3-1-pro\/\">Gemini 3.1 Pro<\/a><\/strong> \u2014 released February 19, 2026 \u2014 which scores 80.6% on SWE-bench Verified and 68.5% on Terminal-Bench 2.0. Gemini 3 Flash (approximately 78% SWE-bench Verified) is the lighter, cheaper option within the same CLI. These are distinct capabilities and the Gemini 3.1 Pro number is the correct headline for what Gemini CLI can deliver at full configuration.<\/p>\n<p>Gemini 3.1 Pro also scores strongly on several non-coding benchmarks: ARC-AGI-2 (77.1%), GPQA Diamond (94.3%), and BrowseComp (85.9%), making it a strong option for scientific computing, agentic research workflows, and tasks that mix coding with deep reasoning. For Google Cloud-native teams, Gemini CLI integrates directly with GCP, Vertex AI, and Android Studio.<\/p>\n<p>The free tier is its most strategically distinctive feature. Solo developers, students, and open-source maintainers who cannot justify a $20\u2013$200\/month coding agent subscription have a legitimate frontier-quality option here. At 80.6% SWE-bench Verified \u2014 matching Claude Opus 4.6 and ahead of GitHub Copilot\u2019s default configuration \u2014 this is not a compromise free tier. It is a genuinely competitive product that removes cost as a barrier to entry.<\/p>\n<p><strong>Best for:<\/strong> Cost-sensitive developers, Google Cloud teams, and individual contributors who want frontier model quality without a monthly subscription.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#5. <a href=\"https:\/\/github.com\/features\/copilot\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Copilot (Microsoft\/GitHub)<\/a><\/strong><\/h3>\n<p><strong>SWE-bench Verified (Agent Mode, default model):<\/strong> ~56% <strong>Adoption:<\/strong> 4.7 million paid subscribers (January 2026) <strong>Pricing:<\/strong> $10\/month (Pro), $19\/month (Business), $39\/month (Pro+), Enterprise custom pricing; AI Credits billing transition on June 1, 2026<\/p>\n<p>GitHub Copilot is not the most capable agent on this list by benchmark, but it is the most widely deployed. With 4.7 million paid subscribers \u2014 75% year-over-year growth \u2014 and 76% developer awareness per GitHub\u2019s Octoverse report, Copilot is the baseline AI coding tool at most enterprise software organizations. Microsoft CEO Satya Nadella confirmed in early 2026 that Copilot now represents a larger business than GitHub itself.<\/p>\n<p>Two important updates for the current pricing picture: GitHub added a <strong>Copilot Pro+ tier at $39\/month<\/strong> that unlocks the full model roster and higher compute limits. More significantly, GitHub announced that <strong><a href=\"https:\/\/github.blog\/news-insights\/company-news\/github-copilot-is-moving-to-usage-based-billing\/\">Copilot is moving to AI Credits-based billing on June 1, 2026<\/a><\/strong>, which means certain agent actions, premium model calls, and background task execution will draw from a credits pool rather than being included in the flat monthly fee. Base plan prices are unchanged as of the announcement, but total cost for heavy agentic use may increase depending on how credits are consumed.<\/p>\n<p>On model selection: in February 2026, <a href=\"https:\/\/github.blog\/changelog\/2026-02-26-claude-and-codex-now-available-for-copilot-business-pro-users\/\">GitHub made Copilot a multi-model platform<\/a> by adding Claude and OpenAI Codex as available backends for Copilot Business and Pro customers. The 56% SWE-bench figure reflects the default proprietary Copilot model. Configuring it to use Claude Opus 4.7 or GPT-5.5 would push that number substantially higher \u2014 though premium model calls draw from the credits pool under the new billing model.<\/p>\n<p>At $10\/month for individuals and $19\/month for business seats, Copilot\u2019s price-to-capability ratio is the strongest entry point for enterprise teams that need predictable licensing, SOC 2 compliance, audit logs, and broad IDE support across VS Code, JetBrains, Visual Studio, Neovim, and Xcode. In enterprise procurement, compliance posture often outweighs a few SWE-bench percentage points.<\/p>\n<p><strong>Best for:<\/strong> Enterprise teams that need predictable licensing, compliance posture, and broad IDE support across multiple environments.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#6. <a href=\"https:\/\/cognition.ai\/blog\/devin-2\" target=\"_blank\" rel=\"noreferrer noopener\">Devin 2.0 (Cognition AI)<\/a><\/strong><\/h3>\n<p><strong>Performance:<\/strong> Higher on clearly scoped tasks; significantly weaker on ambiguous or complex tasks <strong>Pricing (updated April 14, 2026):<\/strong> Free, Pro $20\/month, Max $200\/month, Teams usage-based with $80\/month minimum, Enterprise custom<\/p>\n<p>Devin holds a special place in this category\u2019s history. Its 13.86% SWE-bench Lite score at launch in early 2024 \u2014 the first time any AI system had autonomously resolved real GitHub issues at meaningful scale \u2014 was industry-defining. By today\u2019s standards, every tool above it in this ranking has surpassed that number by a factor of four or more.<\/p>\n<p>Devin 2.0 is a substantially different product. It runs in a fully sandboxed cloud environment with its own IDE, browser, terminal, and shell. You assign a task; Devin produces a step-by-step plan you can review and edit; then it writes code, runs tests, and submits a pull request. Interactive Planning and Devin Wiki \u2014 which auto-indexes repositories and generates architecture documentation \u2014 address two of the original\u2019s biggest criticisms.<\/p>\n<p>On well-scoped, well-defined tasks \u2014 framework upgrades, library migrations, tech debt cleanup, test coverage additions \u2014 Devin reports higher success rates, with independent developer testing consistently showing strong results on clearly specified work. Reliability drops sharply for ambiguous or architecturally complex tasks; one documented community test found far more failures than successes across 20 varied tasks, highlighting that task specification quality directly determines output quality.<\/p>\n<p><strong>On pricing<\/strong>: <a href=\"https:\/\/cognition.ai\/blog\/new-self-serve-plans-for-devin\">Cognition retired its older Core and ACU-based self-serve plans on April 14, 2026<\/a> and introduced cleaner tiers: Free, Pro at $20\/month, Max at $200\/month, Teams usage-based with an $80\/month minimum, and Enterprise with custom pricing. If you have seen the earlier \u201c$20 Core + $2.25\/ACU\u201d pricing in other articles, it is no longer current.<\/p>\n<p>Cognition also partnered with Cognizant in January 2026 to integrate Devin into enterprise engineering transformation offerings, and launched Cognition for Government in February 2026 with FedRAMP High authorization in progress \u2014 signaling a deliberate push into institutional deployments.<\/p>\n<p><strong>Best for:<\/strong> Teams with clearly scoped, well-specified engineering tasks \u2014 migrations, test generation, framework upgrades \u2014 where the cost of reviewing AI output is lower than the cost of doing the work manually.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#7. <a href=\"https:\/\/github.com\/OpenHands\/OpenHands\">OpenHands<\/a> \/ OpenDevin (All-Hands AI)<\/strong><\/h3>\n<p><strong>SWE-bench Verified:<\/strong> 72% <strong>GAIA Benchmark:<\/strong> 67.9% <strong>License:<\/strong> MIT <strong>Pricing:<\/strong> Free to self-host; pay only for model API inference<\/p>\n<p>OpenHands (formerly OpenDevin, rebranded in late 2024 under the All-Hands AI organization) is the open-source community\u2019s answer to Devin. With strong open-source adoption visible through GitHub activity and community usage, and a 72% SWE-bench Verified score, it matches or exceeds commercial agents at several price points.<\/p>\n<p>OpenHands supports 100+ LLM backends \u2014 any OpenAI-compatible API, including Claude, GPT-5, Mistral, Llama, and local models via Ollama. The CodeAct agent can execute code, run terminal commands, browse the web, and interact with web-based development tools inside a Docker sandbox. Its 67.9% on the GAIA benchmark confirms that web interaction capabilities are substantive.<\/p>\n<p>The bring-your-own-key model means zero platform markup \u2014 you pay inference costs directly to your model provider. For open-source projects, budget-constrained teams, and developers who want full auditability of agent behavior, it is the strongest option in this tier. Self-hosting requires Docker and access to an LLM provider API; there is no hosted SaaS product.<\/p>\n<p><strong>Best for:<\/strong> Open-source teams, developers who want full control and auditability, and budget-conscious practitioners who already have API credits with a major model provider.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#8. <a href=\"https:\/\/www.augmentcode.com\/product\/ide-agents\" target=\"_blank\" rel=\"noreferrer noopener\">Augment Code<\/a><\/strong><\/h3>\n<p><strong>SWE-bench Verified (self-reported, Augment harness):<\/strong> 70.6% <strong>Differentiator:<\/strong> Full repository context engine; MCP-interoperable <strong>Pricing:<\/strong> Team and Enterprise tiers<\/p>\n<p>Augment Code\u2019s 70.6% SWE-bench score is self-reported using Augment\u2019s own harness and <a href=\"https:\/\/www.augmentcode.com\/blog\/auggie-tops-swe-bench-pro\">published on Augment\u2019s engineering blog<\/a>. As with all agent-scaffolding-dependent scores, it should be read as \u201cwhat Augment + Opus 4.5 achieves with Augment\u2019s context engine,\u201d not a standalone model number. That caveat stated, the architectural insight behind the score is real and independently validated: in the February 2026 scaffold comparison described earlier, Augment\u2019s context-first approach outperformed other frameworks running the same model by 17 problems out of 731.<\/p>\n<p>The core innovation is that Augment\u2019s engine indexes an entire repository before the agent begins work \u2014 rather than building context reactively from open files. For enterprise teams working in large, mature monorepos, this produces measurably better results on tasks that require cross-module reasoning. Augment also exposes its context engine via MCP (Model Context Protocol), making it interoperable with other agents. A developer could use Augment\u2019s indexing while running Claude Code or Codex for generation.<\/p>\n<p><strong>Best for:<\/strong> Enterprise teams with large, mature codebases who need deeper repository context than single-session tools provide.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#9. <a href=\"https:\/\/aider.chat\/\">Aider<\/a><\/strong><\/h3>\n<p><strong>Pricing:<\/strong> Free (open-source); pay for model API inference <strong>Architecture:<\/strong> Git-native terminal agent<\/p>\n<p>Aider is the git-native coding agent: it operates directly in your local repository and structures its changes as a series of atomic git commits with descriptive messages \u2014 a workflow that meshes well with teams that do careful code review. It supports any OpenAI-compatible model, giving the same model-agnostic flexibility as OpenHands, and runs entirely in the terminal with no IDE dependency.<\/p>\n<p>Where Aider lags behind higher-ranked tools is on complex, multi-step agentic tasks that require web access, browser interaction, or long-horizon planning. It is a powerful tool within a clearly defined scope \u2014 terminal-based, git-integrated coding \u2014 rather than a general-purpose autonomous agent.<\/p>\n<p><strong>Best for:<\/strong> Developers who prioritize git-native workflows, clean commit histories, and full control over their editor environment.<\/p>\n<h3 class=\"wp-block-heading\"><strong>#10. <a href=\"https:\/\/cline.bot\/\">Cline<\/a> (Open-Source)<\/strong><\/h3>\n<p>Cline is VS Code\u2019s most popular open-source AI coding extension, with 5 million installs claimed across supported marketplaces. It ships with Plan\/Act modes, can run terminal commands, edit files across a repository, automate browser testing, and extend through any MCP server. The bring-your-own-key architecture means zero inference markup. Roo Code, a community fork, offers additional customization for teams that want to go beyond the core project.<\/p>\n<p><strong>Best for:<\/strong> VS Code developers who want open-source flexibility, full code auditability, and the ability to bring their own models without platform markup.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"mtp-progress-bar\">\n<div class=\"mtp-progress-fill\"><\/div>\n<\/div>\n<div class=\"mtp-wrap\">\n<div class=\"mtp-track\">\n<p><!-- SLIDE 1: COVER --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">01 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Research Report \u00b7 May 2026<\/div>\n<h2 class=\"mtp-title\">Best AI Agents for Software Development \u2014 Ranked<\/h2>\n<p class=\"mtp-sub\">A benchmark-driven look at the current field<\/p>\n<p class=\"mtp-body\">10 agents ranked by SWE-bench Verified, SWE-bench Pro, Terminal-Bench 2.0, and real developer usage. Includes the contamination warning every ranking is missing.<\/p>\n<div class=\"mtp-metrics\">\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Agents Ranked<\/div>\n<div class=\"mtp-metric-val\">10<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Top SWE-bench Score<\/div>\n<div class=\"mtp-metric-val\">93.9%<\/div>\n<div class=\"mtp-metric-note\">Claude Mythos Preview (restricted)<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Best Available<\/div>\n<div class=\"mtp-metric-val\">87.6%<\/div>\n<div class=\"mtp-metric-note\">Claude Code \/ Opus 4.7<\/div>\n<\/div><\/div>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">What\u2019s inside<\/div>\n<div class=\"mtp-bestfor-text\">Rankings \u00b7 Benchmark methodology \u00b7 SWE-bench contamination \u00b7 Security &amp; governance \u00b7 Layered stack guide<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 2: BENCHMARK WARNING --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">02 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/> Benchmark Alert<\/div>\n<h2 class=\"mtp-title\">The benchmark everyone cites is now disputed<\/h2>\n<div class=\"mtp-warn-box\">\n<div class=\"mtp-warn-head\">SWE-bench Verified \u2014 contaminated as of Feb 2026<\/div>\n<p>On February 23, 2026, OpenAI\u2019s Frontier Evals team stopped reporting SWE-bench Verified scores. Their audit found <strong>59.4% of the hardest test cases had fundamental flaws<\/strong>, and that every major frontier model \u2014 GPT-5.2, Claude Opus 4.5, Gemini 3 Flash \u2014 could reproduce gold-patch solutions verbatim from memory using only a task ID. The benchmark was measuring training data exposure, not coding ability.<\/p>\n<\/div>\n<p class=\"mtp-body\">OpenAI now recommends <strong>SWE-bench Pro<\/strong> for frontier coding evaluation. Other labs still publish Verified scores \u2014 they remain useful for broad direction, but should not be treated as clean, objective measurements. All scores in this guide are labeled accordingly.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Key rule<\/div>\n<div class=\"mtp-bestfor-text\">Treat SWE-bench Verified as directional. Prefer SWE-bench Pro or your own held-out evaluation on real code.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 3: HOW TO READ BENCHMARKS --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">03 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Benchmark Guide<\/div>\n<h2 class=\"mtp-title\">Three benchmarks \u2014 what each actually measures<\/h2>\n<div class=\"mtp-grid-2\">\n<div class=\"mtp-card\">\n<h4>SWE-bench Verified<\/h4>\n<p>      <span class=\"mtp-num\">~88%<\/span><\/p>\n<p>500 real GitHub issues (Python only). Now contaminated. Self-reported. Use as direction only.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>SWE-bench Pro<\/h4>\n<p>      <span class=\"mtp-num\">23\u201364%<\/span><\/p>\n<p>1,865 tasks across 4 languages. Scores vary <strong>wildly by harness<\/strong> \u2014 sub-25% under SWE-Agent, 64% under optimized scaffolds. Same benchmark, different conditions.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>Terminal-Bench 2.0<\/h4>\n<p>      <span class=\"mtp-num\">~82%<\/span><\/p>\n<p>Terminal workflows: shell, DevOps, pipelines. GPT-5.5 leads at 82.7%. Harness matters: same model can score 57.5% vs 64.7% depending on setup.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>Scaffolding effect<\/h4>\n<p>      <span class=\"mtp-num\">\u00b117<\/span><\/p>\n<p>Same Opus 4.5 model, three frameworks, 731 problems \u2014 <strong>17 problems apart<\/strong>. Scaffolding \u2248 model quality.<\/p>\n<\/div>\n<\/div>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Bottom line<\/div>\n<div class=\"mtp-bestfor-text\">No benchmark is a clean proxy. Run 50\u2013100 tasks on your own codebase before committing to any tool.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 4: #1 CLAUDE CODE --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">04 \/ 14<\/span><\/div>\n<div class=\"mtp-rank-badge\">1<\/div>\n<h3 class=\"mtp-tool\">Claude Code \u2014 Anthropic<\/h3>\n<p class=\"mtp-sub\">Opus 4.7 \u00b7 Released April 16, 2026<\/p>\n<div class=\"mtp-bar-wrap\">\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">SWE-bench Verified<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">87.6%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">SWE-bench Pro<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">64.3%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">Terminal-Bench 2.0<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">69.4%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">CursorBench<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">70%<\/span><\/p><\/div>\n<\/div>\n<p class=\"mtp-body\">Self-verification (writes tests, runs them, fixes failures before surfacing results). Multi-agent coordination for parallel workstreams. 1M token context for large repos. <strong>Pricing:<\/strong> $20\u2013$200\/month subscription \u00b7 API $5\/$25 per 1M tokens.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Best for<\/div>\n<div class=\"mtp-bestfor-text\">Complex multi-file engineering, large codebases, long-horizon refactoring \u2014 highest code quality of any publicly available agent.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 5: #2 OPENAI CODEX --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">05 \/ 14<\/span><\/div>\n<div class=\"mtp-rank-badge\">2<\/div>\n<h3 class=\"mtp-tool\">OpenAI Codex \u2014 GPT-5.5<\/h3>\n<p class=\"mtp-sub\">Released April 23, 2026 \u00b7 CLI runs locally on your machine<\/p>\n<div class=\"mtp-bar-wrap\">\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">Terminal-Bench 2.0<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">82.7% #1<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">SWE-bench Pro (Public)<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">58.6%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">SWE-bench Verified*<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill mtp-dim-bar\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct mtp-dim\">~88.7%<\/span><\/p><\/div>\n<\/div>\n<p class=\"mtp-body\"><strong>Important:<\/strong> The Codex CLI is a local terminal tool \u2014 cloud execution is the Codex Web\/IDE product. *OpenAI does not self-report Verified scores; ~88.7% is from third-party trackers. <strong>Pricing:<\/strong> CLI open-source (ChatGPT plan or API key required) \u00b7 Plus $20\/mo \u00b7 API $5\/$30 per 1M tokens.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Best for<\/div>\n<div class=\"mtp-bestfor-text\">Terminal-native DevOps workflows, pipeline automation, fire-and-forget cloud execution via Codex Web \u2014 and the strongest Terminal-Bench score available.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 6: #3 CURSOR --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">06 \/ 14<\/span><\/div>\n<div class=\"mtp-rank-badge\">3<\/div>\n<h3 class=\"mtp-tool\">Cursor<\/h3>\n<p class=\"mtp-sub\">AI-native VS Code fork \u00b7 $2B ARR (Feb 2026)<\/p>\n<div class=\"mtp-metrics\">\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Default SWE-bench<\/div>\n<div class=\"mtp-metric-val\">~51.7%<\/div>\n<div class=\"mtp-metric-note\">model-dependent<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Speed vs Copilot<\/div>\n<div class=\"mtp-metric-val\">+30%<\/div>\n<div class=\"mtp-metric-note\">task completion<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">With Opus 4.7<\/div>\n<div class=\"mtp-metric-val\">\u2191\u2191<\/div>\n<div class=\"mtp-metric-note\">ceiling rises to 87.6%<\/div>\n<\/div><\/div>\n<p class=\"mtp-body\">Model-agnostic: supports Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, Grok. Plan\/Act mode for structured workflows. Background Agents (Pro+ $60\/mo) run autonomous cloud sessions in parallel. <strong>Important limitation:<\/strong> VS Code only \u2014 no JetBrains, Neovim, or Xcode support.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Best for<\/div>\n<div class=\"mtp-bestfor-text\">VS Code-native developers who want the best AI-integrated daily editing experience. $20\/month Pro is the most productive IDE-native entry point.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 7: #4 GEMINI CLI --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">07 \/ 14<\/span><\/div>\n<div class=\"mtp-rank-badge\">4<\/div>\n<h3 class=\"mtp-tool\">Gemini CLI \u2014 Google DeepMind<\/h3>\n<p class=\"mtp-sub\">Gemini 3.1 Pro \u00b7 Free tier available<\/p>\n<div class=\"mtp-bar-wrap\">\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">SWE-bench Verified<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">80.6%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">Terminal-Bench 2.0<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">68.5%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">GPQA Diamond<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">94.3%<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">ARC-AGI-2<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">77.1%<\/span><\/p><\/div>\n<\/div>\n<p class=\"mtp-body\">Primary model: Gemini 3.1 Pro (80.6%). Gemini 3 Flash (~78%) is the lighter\/cheaper option. 1M token context. Install: <code>npm install -g @google\/gemini-cli<\/code>. Free tier removes all cost barriers.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Best for<\/div>\n<div class=\"mtp-bestfor-text\">Cost-sensitive developers, Google Cloud teams, and anyone wanting frontier-quality coding without a monthly subscription.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 8: #5 GITHUB COPILOT --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">08 \/ 14<\/span><\/div>\n<div class=\"mtp-rank-badge\">5<\/div>\n<h3 class=\"mtp-tool\">GitHub Copilot<\/h3>\n<p class=\"mtp-sub\">4.7M paid subscribers \u00b7 Multi-model platform since Feb 2026<\/p>\n<div class=\"mtp-metrics\">\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Default SWE-bench<\/div>\n<div class=\"mtp-metric-val\">~56%<\/div>\n<div class=\"mtp-metric-note\">Agent Mode<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Pro tier<\/div>\n<div class=\"mtp-metric-val\">$10\/mo<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">AI Credits<\/div>\n<div class=\"mtp-metric-val\">Jun 1<\/div>\n<div class=\"mtp-metric-note\">billing transition 2026<\/div>\n<\/div><\/div>\n<p class=\"mtp-body\">Now supports Claude Opus 4.7 and GPT-5.5 as backends (premium model calls draw from AI Credits). Works across VS Code, JetBrains, Visual Studio, Neovim, Xcode. <strong>Pricing:<\/strong> $10 Pro \u00b7 $19 Business \u00b7 $39 Pro+ \u00b7 Enterprise custom.<\/p>\n<div class=\"mtp-pill mtp-green\">SOC 2 compliant<\/div>\n<div class=\"mtp-pill mtp-green\">Audit logs<\/div>\n<div class=\"mtp-pill mtp-green\">6 IDEs<\/div>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Best for<\/div>\n<div class=\"mtp-bestfor-text\">Enterprise teams needing predictable licensing, compliance posture, and broad IDE support across every environment.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 9: #6 DEVIN + #7 OPENHANDS --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">09 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Autonomous Agents<\/div>\n<h2 class=\"mtp-title\">#6 Devin 2.0 &amp; #7 OpenHands<\/h2>\n<div class=\"mtp-grid-2\">\n<div class=\"mtp-card\">\n<h4>#6 Devin 2.0 \u2014 Cognition AI<\/h4>\n<p>      <span class=\"mtp-num\">Sandboxed<\/span><\/p>\n<p>Full cloud VM with IDE, browser, terminal. Plans + executes + submits PRs autonomously. <strong>Higher success on clearly scoped tasks; significantly weaker on ambiguous work.<\/strong><\/p>\n<p>Updated Apr 14: Free \u00b7 Pro $20 \u00b7 Max $200 \u00b7 Teams $80\/mo min \u00b7 Enterprise<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>#7 OpenHands \u2014 All-Hands AI<\/h4>\n<p>      <span class=\"mtp-num\">72%<\/span><\/p>\n<p>SWE-bench Verified. MIT licensed, free to self-host. 100+ LLM backends. CodeAct agent with Docker sandboxing and web browsing. GAIA: 67.9%.<\/p>\n<p>Pay only for API inference \u00b7 No hosted SaaS<\/p>\n<\/div>\n<\/div>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Choose Devin if<\/div>\n<div class=\"mtp-bestfor-text\">You have clearly scoped, well-specified tasks (migrations, test coverage, framework upgrades) and capacity to review AI output before merging.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 10: #8-10 OPEN SOURCE --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">10 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Open-Source Tier<\/div>\n<h2 class=\"mtp-title\">#8 Augment Code \u00b7 #9 Aider \u00b7 #10 Cline<\/h2>\n<div class=\"mtp-bar-wrap\">\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">#8 Augment Code<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">70.6%*<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">#9 Aider<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill mtp-dim-bar\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">model-dep<\/span><\/p><\/div>\n<div class=\"mtp-bar-row\"><span class=\"mtp-bar-label\">#10 Cline<\/span>\n<div class=\"mtp-bar-track\">\n<div class=\"mtp-bar-fill mtp-dim-bar\"><\/div>\n<\/div>\n<p><span class=\"mtp-bar-pct\">model-dep<\/span><\/p><\/div>\n<\/div>\n<p>*Augment score is self-reported via Augment\u2019s own harness<\/p>\n<p class=\"mtp-body\"><strong>Augment Code<\/strong> \u2014 full repo context indexing before the agent starts; MCP-interoperable. Best for large enterprise monorepos.<br \/><strong>Aider<\/strong> \u2014 git-native terminal agent producing atomic commits. Best for clean commit-level workflows.<br \/><strong>Cline<\/strong> \u2014 5M installs, VS Code extension, bring-your-own-key, zero inference markup. Roo Code is the community fork.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">All three<\/div>\n<div class=\"mtp-bestfor-text\">Pay only for API inference (no platform markup). Full code auditability. Effective ceiling scales with your chosen model.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 11: SCAFFOLDING PROBLEM --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">11 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Key Insight<\/div>\n<h2 class=\"mtp-title\">The scaffolding problem \u2014 same model, 17 problems apart<\/h2>\n<div class=\"mtp-metrics\">\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Problems tested<\/div>\n<div class=\"mtp-metric-val\">731<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Model used<\/div>\n<div class=\"mtp-metric-val\">Same<\/div>\n<div class=\"mtp-metric-note\">Claude Opus 4.5<\/div>\n<\/div>\n<div class=\"mtp-metric\">\n<div class=\"mtp-metric-label\">Score gap<\/div>\n<div class=\"mtp-metric-val\">17<\/div>\n<div class=\"mtp-metric-note\">problems apart (Feb 2026)<\/div>\n<\/div><\/div>\n<p class=\"mtp-body\">In February 2026, three different agent frameworks ran identical models against the same 731 SWE-bench problems. They scored 17 issues apart \u2014 a 2.3-point gap \u2014 purely from scaffolding differences. The winner (Augment Code) indexed the full repository before starting. The runner-up used a standard tool-call loop. The third used one-shot generation.<\/p>\n<p class=\"mtp-body\"><strong>Implication:<\/strong> A benchmark score labeled with a model name reflects the model AND the scaffold around it. Choosing an agent based solely on the model name \u2014 \u201cI\u2019ll use whichever tool runs Opus 4.7\u201d \u2014 ignores the variable that often matters most.<\/p>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Rule of thumb<\/div>\n<div class=\"mtp-bestfor-text\">Context strategy + retrieval quality + verification loops \u2248 model version, when it comes to benchmark outcomes.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 12: SECURITY &amp; GOVERNANCE --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">12 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Production Teams<\/div>\n<h2 class=\"mtp-title\">Security &amp; governance \u2014 what benchmarks don\u2019t measure<\/h2>\n<div class=\"mtp-grid-2\">\n<div class=\"mtp-card\">\n<h4><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f512.png\" alt=\"\ud83d\udd12\" class=\"wp-smiley\" \/> Sandboxing<\/h4>\n<p>Devin and Codex Web run in isolated cloud VMs. Claude Code and Cline run with local system access by default. Know the difference.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f511.png\" alt=\"\ud83d\udd11\" class=\"wp-smiley\" \/> Secret exposure<\/h4>\n<p>Agents that read <code>.env<\/code> files and config dirs are an active attack surface. Explicit access controls are non-optional.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f489.png\" alt=\"\ud83d\udc89\" class=\"wp-smiley\" \/> Prompt injection<\/h4>\n<p>Malicious strings in code comments, issue descriptions, or docs can instruct agents to take unauthorized actions. This is a known vulnerability class.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4cb.png\" alt=\"\ud83d\udccb\" class=\"wp-smiley\" \/> Audit logging<\/h4>\n<p>GitHub Copilot and Augment Code have explicit audit log features. Open-source tools generally do not \u2014 instrument yourself or choose a tool that does.<\/p>\n<\/div>\n<\/div>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">Before you ship AI-generated code<\/div>\n<div class=\"mtp-bestfor-text\">Define your human review gate explicitly. The organizations running agentic coding safely in 2026 treat that gate as a policy, not a developer preference.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 13: HOW DEVS USE THESE --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">13 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Developer Patterns<\/div>\n<h2 class=\"mtp-title\">How 70% of developers actually stack these tools<\/h2>\n<div class=\"mtp-grid-2\">\n<div class=\"mtp-card\">\n<h4>Layer 1 \u2014 Terminal agent<\/h4>\n<p><strong>Claude Code or Codex<\/strong> for complex work: multi-file refactors, architectural changes, difficult debugging. Use when a task would take a senior engineer hours.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>Layer 2 \u2014 IDE extension<\/h4>\n<p><strong>Cursor or Copilot<\/strong> for daily editing: inline completions, quick edits, test generation. Eliminates context-switching overhead for routine work.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>Layer 3 \u2014 Open-source tool<\/h4>\n<p><strong>Aider, Cline, or OpenHands<\/strong> for model flexibility, zero markup on inference, and full auditability. Fallback when commercial tools have outages or price changes.<\/p>\n<\/div>\n<div class=\"mtp-card\">\n<h4>Most common setup<\/h4>\n<p>Claude Code \/ Codex for hard tasks + Copilot or Cursor for daily flow + one open-source tool for flexibility. Layer 1 + Layer 2 costs ~$30\u201340\/mo.<\/p>\n<\/div>\n<\/div>\n<div class=\"mtp-bestfor\">\n<div class=\"mtp-bestfor-label\">The point<\/div>\n<div class=\"mtp-bestfor-text\">Using multiple tools isn\u2019t indecision \u2014 it reflects genuine specialization. No single agent dominates all three layers with equal quality today.<\/div>\n<\/div>\n<\/div>\n<p><!-- SLIDE 14: RANKINGS TABLE --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-topbar\"><span class=\"mtp-logo\">Marktechpost<\/span><span class=\"mtp-counter\">14 \/ 14<\/span><\/div>\n<div class=\"mtp-tag\">Summary Rankings \u00b7 May 2026<\/div>\n<h2 class=\"mtp-title\">Full leaderboard<\/h2>\n<div>\n<table class=\"mtp-table\">\n<thead>\n<tr>\n<th>#<\/th>\n<th>Agent<\/th>\n<th>Key Metric<\/th>\n<th>Best For<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"mtp-score-dim\">\u2014<\/td>\n<td class=\"mtp-tool-name\">Claude Mythos Preview<\/td>\n<td class=\"mtp-score-dim\">93.9% SWE-b-V (restricted)<\/td>\n<td>Not publicly available<\/td>\n<\/tr>\n<tr>\n<td class=\"mtp-score\">1<\/td>\n<td class=\"mtp-tool-name\">Claude Code (Opus 4.7)<\/td>\n<td class=\"mtp-score\">87.6% SWE-b-V<\/td>\n<td>Code quality, multi-file tasks<\/td>\n<\/tr>\n<tr>\n<td class=\"mtp-score\">2<\/td>\n<td class=\"mtp-tool-name\">OpenAI Codex (GPT-5.5)<\/td>\n<td class=\"mtp-score\">82.7% Terminal-Bench<\/td>\n<td>Terminal \/ DevOps workflows<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td class=\"mtp-tool-name\">Cursor<\/td>\n<td>~51.7% default (\u2191 w\/ Opus 4.7)<\/td>\n<td>IDE-native daily dev<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td class=\"mtp-tool-name\">Gemini CLI<\/td>\n<td>80.6% SWE-b-V<\/td>\n<td>Free tier, Google Cloud<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td class=\"mtp-tool-name\">GitHub Copilot<\/td>\n<td>~56% default Agent Mode<\/td>\n<td>Enterprise, multi-IDE<\/td>\n<\/tr>\n<tr>\n<td>6<\/td>\n<td class=\"mtp-tool-name\">Devin 2.0<\/td>\n<td>Sandboxed autonomous<\/td>\n<td>Well-scoped tasks<\/td>\n<\/tr>\n<tr>\n<td>7<\/td>\n<td class=\"mtp-tool-name\">OpenHands<\/td>\n<td>72% SWE-b-V<\/td>\n<td>Open-source, any model<\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td class=\"mtp-tool-name\">Augment Code<\/td>\n<td>70.6%* (self-reported)<\/td>\n<td>Large enterprise codebases<\/td>\n<\/tr>\n<tr>\n<td>9<\/td>\n<td class=\"mtp-tool-name\">Aider<\/td>\n<td>Model-dependent<\/td>\n<td>Git-native CLI<\/td>\n<\/tr>\n<tr>\n<td>10<\/td>\n<td class=\"mtp-tool-name\">Cline<\/td>\n<td>Model-dependent<\/td>\n<td>VS Code open-source<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<div class=\"mtp-source\">SWE-b-V = SWE-bench Verified (self-reported, see contamination note). Read the full article for primary source links.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"mtp-nav\">\n  <button class=\"mtp-btn\" aria-label=\"Previous slide\">\u2190<\/button>\n<div class=\"mtp-dots\"><\/div>\n<p>  <button class=\"mtp-btn\" aria-label=\"Next slide\">\u2192<\/button>\n<\/p><\/div>\n<div class=\"mtp-source\">Marktechpost.com \u00b7 Sources: <a href=\"https:\/\/www.anthropic.com\/news\/claude-opus-4-7\" target=\"_blank\">Anthropic<\/a> \u00b7 <a href=\"https:\/\/openai.com\/index\/introducing-gpt-5-5\/\" target=\"_blank\">OpenAI<\/a> \u00b7 <a href=\"https:\/\/openai.com\/index\/why-we-no-longer-evaluate-swe-bench-verified\/\" target=\"_blank\">SWE-bench contamination post<\/a> \u00b7 <a href=\"https:\/\/labs.scale.com\/leaderboard\/swe_bench_pro_public\" target=\"_blank\">Scale AI leaderboard<\/a> \u00b7 <a href=\"https:\/\/cognition.ai\/blog\/new-self-serve-plans-for-devin\" target=\"_blank\">Cognition pricing<\/a><\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>How Developers Are Actually Using These Tools in 2026<\/strong><\/h2>\n<p>The benchmark-maximizing strategy and the productivity-maximizing strategy are not the same thing. Based on community data and developer surveys, approximately 70% of productive professional developers in 2026 use two or more tools simultaneously. <\/p>\n<p><strong>The modal pattern is a layered stack:<\/strong><\/p>\n<p><strong>Terminal agents for complex tasks.<\/strong> Claude Code or Codex for multi-file refactoring, architectural changes, difficult debugging, or any task that requires holding substantial codebase context. These tools earn their higher cost on work that would take a senior engineer hours.<\/p>\n<p><strong>IDE extensions for daily editing.<\/strong> Cursor or GitHub Copilot for inline completions, quick edits, test generation, and ambient assistance that speeds up routine coding work. The cognitive overhead of switching between a terminal agent and a separate editor is real; IDE-native tools eliminate it for everyday tasks.<\/p>\n<p><strong>Open-source tools for model flexibility.<\/strong> Aider, Cline, or OpenHands when you want to test a new model, avoid platform markup, or need full auditability of agent behavior. These also serve as a fallback when commercial tools have outages or pricing changes.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What the Next 12 Months Look Like<\/strong><\/h2>\n<p><strong>MCP as infrastructure.<\/strong> The Model Context Protocol is emerging as a shared standard that lets tools share context, hand off tasks, and compose capabilities. Augment&#8217;s context engine exposed via MCP, and Copilot accepting Claude and Codex as backends, suggest the field is moving toward interoperability rather than winner-take-all consolidation.<\/p>\n<p><strong>Autonomous PR pipelines.<\/strong> GitHub Copilot&#8217;s cloud agent, Codex&#8217;s background execution model, and Devin&#8217;s end-to-end PR workflow all point at the same future: AI agents that process issues from a backlog, work overnight, and surface reviewed pull requests in the morning. The bottleneck is no longer AI quality \u2014 it is the review bandwidth of human engineers and the governance frameworks organizations are building around autonomous code changes.<\/p>\n<p><strong>Enterprise governance as a differentiator<\/strong>: Gartner projects 40% of enterprise applications will include task-specific AI agents by end of 2026, up from less than 5% today. Compliance posture, audit logs, data handling guarantees, and security certifications will increasingly be the deciding factor in enterprise procurement \u2014 not SWE-bench position.<\/p>\n<p><strong>Open-source convergence<\/strong>: OpenHands at 72% SWE-bench Verified, and open-source models like MiniMax M2.5 (80.2% SWE-bench Verified) now matching proprietary frontier performance, show the quality gap between open and closed systems is closing. The remaining advantages for commercial tools are scaffolding sophistication, enterprise support, and product polish \u2014 not raw model capability.<\/p>\n<p><strong>The Mythos ceiling<\/strong>: Claude Mythos Preview at 93.9% SWE-bench Verified \u2014 roughly 5 points above the best publicly available model \u2014 signals that the performance frontier is well ahead of what developers can currently access. When models at that tier reach general availability, expect the category ranking to shift again.<\/p>\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p><em><strong>Primary sources:<\/strong> <a href=\"https:\/\/www.anthropic.com\/news\/claude-opus-4-7\">Anthropic Claude Opus 4.7 announcement<\/a> \u00b7 <a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock\/\">AWS blog: Claude Opus 4.7 on Amazon Bedrock<\/a> \u00b7 <a href=\"https:\/\/openai.com\/index\/introducing-gpt-5-5\/\">OpenAI: Introducing GPT-5.5<\/a> \u00b7 <a href=\"https:\/\/openai.com\/index\/why-we-no-longer-evaluate-swe-bench-verified\/\">OpenAI: Why we no longer evaluate SWE-bench Verified<\/a> \u00b7 <a href=\"https:\/\/openai.com\/index\/introducing-gpt-5-3-codex\/\">OpenAI: Introducing GPT-5.3-Codex<\/a> \u00b7 <a href=\"https:\/\/labs.scale.com\/leaderboard\/swe_bench_pro_public\">Scale AI SWE-bench Pro public leaderboard<\/a> \u00b7 <a href=\"https:\/\/arxiv.org\/html\/2509.16941v1\">SWE-bench Pro arXiv paper<\/a> \u00b7 <a href=\"https:\/\/www.swebench.com\/\">Official SWE-bench leaderboard<\/a> \u00b7 <a href=\"https:\/\/github.com\/openai\/codex\">GitHub: openai\/codex<\/a> \u00b7 <a href=\"https:\/\/cognition.ai\/blog\/new-self-serve-plans-for-devin\">Cognition: New self-serve plans for Devin<\/a> \u00b7 <a href=\"https:\/\/github.blog\/news-insights\/company-news\/github-copilot-is-moving-to-usage-based-billing\/\">GitHub Blog: Copilot moving to usage-based billing<\/a> \u00b7 <a href=\"https:\/\/github.blog\/changelog\/2026-02-26-claude-and-codex-now-available-for-copilot-business-pro-users\/\">GitHub Changelog: Claude and Codex for Copilot Business &amp; Pro<\/a> \u00b7 <a href=\"https:\/\/www.augmentcode.com\/blog\/auggie-tops-swe-bench-pro\">Augment Code: Auggie tops SWE-bench Pro<\/a> \u00b7 <a href=\"https:\/\/www.anthropic.com\/glasswing\">Anthropic Project Glasswing<\/a> \u00b7 <a href=\"https:\/\/deepmind.google\/models\/model-cards\/gemini-3-1-pro\/\">Google DeepMind Gemini 3.1 Pro model card<\/a> \u00b7 <a href=\"https:\/\/github.com\/OpenHands\/OpenHands\">OpenHands GitHub repository<\/a><\/em><\/p>\n\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/15\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\">Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The AI coding agent market looks almost unrecognizable compared to 2024 or even early 2025. What started as inline autocomplete has evolved into fully autonomous systems that read GitHub issues, navigate multi-file codebases, write fixes, execute tests, and open pull requests \u2014 without a human typing a single line of code. By early 2026, roughly 85% of developers reported regularly using some form of AI assistance for coding. The category has fractured into distinct archetypes: terminal agents, AI-native IDEs, cloud-hosted autonomous engineers, and open-source frameworks that let you swap in whatever model you prefer. The problem is that every tool claims to be the best, and the benchmarks used to justify those claims are not always measuring the same things \u2014 and in some cases are no longer credible measures at all. This article features the most important AI coding agents by the metrics that actually matter for production software development, while being honest about where those metrics have broken down. If you are an AI\/ML engineer, software developer, or data scientist trying to decide where to invest your tooling budget in 2026, start here. How to Read These Benchmarks \u2014 Including Why the Most-Cited One Is Now Disputed Before the listing, an important calibration on the numbers \u2014 because one major benchmark shift happened mid-cycle and is not yet reflected in most tool comparison articles. SWE-bench Verified has been the industry\u2019s standard coding benchmark since mid-2024. It presents agents with 500 real GitHub issues drawn from popular Python repositories and measures whether the agent can understand the problem, navigate the codebase, generate a fix, and verify that it passes tests \u2014 end-to-end, without human guidance. It was a credible proxy. In February 2026, that changed. On February 23, 2026, OpenAI\u2019s Frontier Evals team published a detailed post explaining why it had stopped reporting SWE-bench Verified scores. Their auditors reviewed 138 of the hardest problems across 64 independent runs and found that 59.4% had fundamentally flawed or unsolvable test cases \u2014 tests that demanded exact function names not mentioned in the problem statement, or checked unrelated behavior pulled from upstream pull requests. More critically, they found evidence that every major frontier model \u2014 GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash \u2014 could reproduce the gold-patch solutions verbatim from memory using only the task ID, confirming systematic training data contamination. OpenAI\u2019s conclusion: \u201cImprovements on SWE-bench Verified no longer reflect meaningful improvements in models\u2019 real-world software development abilities.\u201d OpenAI now recommends SWE-bench Pro as the replacement for frontier coding evaluation. This does not make SWE-bench Verified scores useless. Other major labs continue to report them, third-party evaluators continue to run them, and they remain useful for broad directional comparison. But any ranking that presents SWE-bench Verified scores as clean, objective measurements of real-world ability \u2014 without this caveat \u2014 is giving you an incomplete picture. All scores in this article are flagged accordingly. SWE-bench Pro is harder to interpret than Verified because published results vary significantly by split, scaffold, harness, and reporting source. The benchmark contains 1,865 total tasks divided into a 731-task public set, an 858-task held-out set, and a 276-task commercial\/private set drawn from 18 proprietary startup codebases. When the original Scale AI paper measured frontier models using a unified SWE-Agent scaffold, top scores were below 25% \u2014 GPT-5 at 23.3% \u2014 reflecting a genuinely harder evaluation. However, current public leaderboard and vendor-reported runs now show substantially higher scores under newer models and optimized agent harnesses: OpenAI reports GPT-5.5 at 58.6% on SWE-bench Pro (Public), while Anthropic\u2019s comparison table lists Claude Opus 4.7 at 64.3% and Gemini 3.1 Pro at 54.2%. These numbers should not be directly compared with the original sub-25% SWE-Agent results without noting the scaffold and split differences \u2014 the benchmark has not changed, but the evaluation conditions and model generations have. When you see a 60%+ SWE-bench Pro score alongside a sub-25% one, they are measuring the same benchmark under very different conditions, not two separate tests. Terminal-Bench 2.0 evaluates terminal-native workflows: shell scripting, file system operations, environment setup, and DevOps automation. As of April 23, 2026, GPT-5.5 leads at 82.7% on this benchmark \u2014 confirmed in OpenAI\u2019s official release. Claude Opus 4.7 scores 69.4% (Anthropic\/AWS-reported), and Gemini 3.1 Pro scores 68.5%. An important methodological caveat: different harnesses produce different numbers for the same model. Anthropic\u2019s Opus 4.6 system card showed GPT-5.2-Codex scoring 57.5% on the independent Terminus-2 harness vs 64.7% on OpenAI\u2019s own Codex CLI harness \u2014 a 7-point gap from harness alone. When comparing Terminal-Bench figures across sources, always check which execution environment was used. One final cross-benchmark caveat: agent scaffolding matters as much as the underlying model. In a February 2026 evaluation of 731 problems, three different agent frameworks running the same Opus 4.5 model scored 17 issues apart \u2014 a 2.3-point gap that changes relative rankings. A benchmark score labeled with a model name reflects the model and the specific scaffold wrapped around it, not the model in isolation. 10 AI Agents for Software Development A Note on Claude Mythos Preview The current leader on SWE-bench Verified among third-party trackers is Claude Mythos Preview at 93.9%, announced April 7, 2026 under Anthropic\u2019s Project Glasswing. It is not generally available. Access is restricted to a limited set of platform partners; Anthropic has stated it does not plan broad release in the near term, in part due to elevated cybersecurity capability concerns. It sits outside the main comparison below because developers cannot access it through standard channels. Its existence does, however, signal that the practical capability ceiling sits substantially above what any publicly available tool currently delivers. #1. Claude Code (Anthropic) SWE-bench Verified (self-reported): 87.6% (Opus 4.7) \/ 80.8% (Opus 4.6) SWE-bench Pro (Anthropic internal variant): 64.3% (Opus 4.7, #1) \/ 53.4% (Opus 4.6) Terminal-Bench 2.0: 69.4% (Opus 4.7, Anthropic-reported) CursorBench: 70% (Opus 4.7, Cursor-reported) Claude Code subscription: $20\u2013$200\/month | Opus 4.7 API: $5\/$25 per million tokens Claude Code is Anthropic\u2019s terminal-native coding agent and the leader on code quality metrics across<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-90574","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-15T16:34:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"26 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field\",\"datePublished\":\"2026-05-15T16:34:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\"},\"wordCount\":5209,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\",\"url\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\",\"name\":\"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\",\"datePublished\":\"2026-05-15T16:34:00+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage\",\"url\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\",\"contentUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/","og_locale":"zh_CN","og_type":"article","og_title":"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-15T16:34:00+00:00","og_image":[{"url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"26 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field","datePublished":"2026-05-15T16:34:00+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/"},"wordCount":5209,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/","url":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/","name":"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png","datePublished":"2026-05-15T16:34:00+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#primaryimage","url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png","contentUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png"},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"The AI coding agent market looks almost unrecognizable compared to 2024 or even early 2025. What started as inline autocomplete has evolved into fully autonomous systems that read GitHub issues, navigate multi-file codebases, write fixes, execute tests, and open pull requests \u2014 without a human typing a single line of code. By early 2026, roughly&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/90574","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=90574"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/90574\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=90574"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=90574"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=90574"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}