{"id":31372,"date":"2025-08-13T06:02:47","date_gmt":"2025-08-13T06:02:47","guid":{"rendered":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/"},"modified":"2025-08-13T06:02:47","modified_gmt":"2025-08-13T06:02:47","slug":"nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/","title":{"rendered":"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents"},"content":{"rendered":"<p>The landscape of software engineering automation is evolving rapidly, driven by advances in Large Language Models (LLMs). However, most approaches to training capable agents rely on proprietary models or costly teacher-based methods, leaving open-weight LLMs with limited capabilities in real-world scenarios. A team of researchers from Nebius AI and Humanoid introduced a reinforcement learning framework for training long-context, multi-turn software engineering agents using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm. The research explains a technical breakthrough in applying <strong>reinforcement learning (RL)<\/strong> to open-source LLMs for genuine, multi-turn software engineering tasks\u2014moving beyond the single-turn, bandit-style settings that dominate RL for LLMs today.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Beyond Single-Turn <\/strong><strong>Reinforcement Learning<\/strong> <strong>RL<\/strong><\/h3>\n<p>Most RL methods for LLMs optimize for tasks such as mathematical reasoning or one-shot code generation, where agent actions are rewarded only at the conclusion and environments do not provide intermediate feedback. However, <strong>software engineering (SWE)<\/strong> is fundamentally different: it requires agents to operate over <strong>long sequences of actions<\/strong>, interpret rich feedback (compiler errors, test logs), and maintain context over hundreds of thousands of tokens\u2014far exceeding typical single-step interaction loops.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Core Challenges in RL for SWE<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Long-Horizon Reasoning:<\/strong> Agents must sustain logical coherence across many steps, often requiring context windows beyond 100k tokens.<\/li>\n<li><strong>Stateful Environment Feedback:<\/strong> Actions yield meaningful, non-trivial observations (e.g., shell command outputs, test suite results) that guide subsequent decisions.<\/li>\n<li><strong>Sparse\/Delayed Rewards:<\/strong> Success signals typically emerge only at the end of complex interactions, complicating credit assignment.<\/li>\n<li><strong>Evaluation Complexity:<\/strong> Measuring progress requires full trajectory unrolling and can be noisy due to test flakiness.<\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"621\" data-attachment-id=\"73543\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/08\/12\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/screenshot-2025-08-12-at-9-28-42-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1.png\" data-orig-size=\"1966,1192\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-08-12 at 9.28.42\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-300x182.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621.png\" alt=\"\" class=\"wp-image-73543\" \/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Technical Recipe: Modified DAPO and Agent Design<\/strong><\/h3>\n<p>The research team demonstrates a <strong>two-stage learning pipeline<\/strong> for training a Qwen2.5-72B-Instruct agent:<\/p>\n<h4 class=\"wp-block-heading\"><strong>1. Rejection Fine-Tuning (RFT)<\/strong><\/h4>\n<p>The journey begins with supervised fine-tuning. The agent is run across 7,249 rigorously filtered SWE tasks (from the SWE-REBENCH dataset). Successful interaction traces\u2014where the agent passes the environmental test suite\u2014are used to fine-tune the model, particularly masking invalid environment-formatting actions during training. This alone boosts baseline accuracy from 11% to 20% on the SWE-bench Verified benchmark.<\/p>\n<h4 class=\"wp-block-heading\"><strong>2. Reinforcement Learning Using Modified DAPO<\/strong><\/h4>\n<p>Building on Decoupled Advantage Policy Optimization (DAPO), several key modifications are introduced for scalability and stability:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Asymmetric Clipping:<\/strong> Prevents collapse in policy entropy, maintaining exploration.<\/li>\n<li><strong>Dynamic Sample Filtering:<\/strong> Focuses optimization on trajectories with actual learning signal.<\/li>\n<li><strong>Length Penalties:<\/strong> Discourages excessive episode length, helping the agent avoid getting stuck in loops.<\/li>\n<li><strong>Token-Level Averaging:<\/strong> Every token in every trajectory contributes equally to the gradient, empowering longer trajectories to influence updates.<\/li>\n<\/ul>\n<p>The agent utilizes a ReAct-style loop, which lets it combine reasoning steps with tool usage. Its supported toolkit includes arbitrary shell commands, precise code edits, navigation\/search utilities, and a submit action to signal episode completion. Each interaction is grounded in a robust sandboxed environment, initialized from real repository snapshots and backed by a GitHub-style issue prompt.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Scaling to Long Contexts and Real Benchmarks<\/strong><\/h3>\n<p>Initially trained with a context length of 65k tokens (already double that of most open models), performance stalls at 32%. A second RL phase expands the context to 131k tokens and doubles the episode length ceiling, focusing subsequent training on only the most beneficial tasks from the pool. This enables scaling to longer stack traces and diff histories inherent to real-world debugging and patching tasks.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Results: Closing the Gap with Baselines<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>The final RL-trained agent attains <strong>39% Pass@1<\/strong> accuracy on the SWE-bench Verified benchmark, <strong>doubling<\/strong> the rejection fine-tuned baseline, and matching the performance of cutting-edge open-weight models such as DeepSeek-V3-0324, all without teacher-based supervision.<\/li>\n<li>On held-out SWE-rebench splits, scores remain competitive (35% for May, 31.7% for June), indicating the method\u2019s robustness.<\/li>\n<li>When compared head-to-head with top open baselines and specialized SWE agents, the RL agent matches or outperforms several models, confirming the effectiveness of the RL methodology in this domain.<\/li>\n<\/ul>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><\/th>\n<th>Pass@1 SWE-bench Verified<\/th>\n<th>Pass@10<\/th>\n<th>Pass@1 SWE-rebench May<\/th>\n<th>Pass@10<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Qwen2.5-72B-Instruct (RL, final)<\/td>\n<td>39.04%<\/td>\n<td>58.4%<\/td>\n<td>35.0%<\/td>\n<td>52.5%<\/td>\n<\/tr>\n<tr>\n<td>DeepSeek-V3-0324<\/td>\n<td>39.56%<\/td>\n<td>62.2%<\/td>\n<td>36.75%<\/td>\n<td>60.0%<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-235B no-thinking<\/td>\n<td>25.84%<\/td>\n<td>54.4%<\/td>\n<td>27.25%<\/td>\n<td>57.5%<\/td>\n<\/tr>\n<tr>\n<td>Llama4 Maverick<\/td>\n<td>15.84%<\/td>\n<td>47.2%<\/td>\n<td>19.0%<\/td>\n<td>50.0%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><em>Pass@1 scores are averaged over 10 runs and reported as mean \u00b1 standard error.<\/em><\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"671\" data-attachment-id=\"73541\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/08\/12\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/screenshot-2025-08-12-at-9-27-54-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.27.54-PM-1.png\" data-orig-size=\"1976,1294\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-08-12 at 9.27.54\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.27.54-PM-1-300x196.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.27.54-PM-1-1024x671.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.27.54-PM-1-1024x671.png\" alt=\"\" class=\"wp-image-73541\" \/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Insights<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Credit Assignment:<\/strong> RL in this sparse-reward regime remains fundamentally challenging. The paper suggests future work with reward shaping, step-level critics, or prefix-based rollouts for more granular feedback.<\/li>\n<li><strong>Uncertainty Estimation:<\/strong> Real-world agents need to know when to abstain or express confidence. Techniques like output entropy or explicit confidence scoring are next steps.<\/li>\n<li><strong>Infrastructure:<\/strong> Training utilized context parallelism (splitting long sequences over GPUs) on 16 H200 nodes, with distributed orchestration via Kubernetes and Tracto AI, and vLLM for fast inference.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n<p>This research validates RL as a potent paradigm for building autonomous software engineers using open-weight LLMs. By conquering long-horizon, multi-turn, real-environment tasks, the methodology paves the way for scalable, teacher-free agent development\u2014directly leveraging the power of interaction rather than static instruction. With further refinements, such RL pipelines promise efficient, reliable, and versatile automation for the future of software engineering.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the <strong><a href=\"https:\/\/arxiv.org\/abs\/2508.03501\" target=\"_blank\" rel=\"noreferrer noopener\">Paper here<\/a><\/strong>. Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link has-luminous-vivid-orange-background-color has-background wp-element-button\" href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f1ec.png\" alt=\"\ud83c\uddec\" class=\"wp-smiley\" \/> Star us on GitHub<\/a><\/div>\n<div class=\"wp-block-button has-custom-width wp-block-button__width-50\"><a class=\"wp-block-button__link has-background wp-element-button\" href=\"https:\/\/promotion.marktechpost.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f1f8.png\" alt=\"\ud83c\uddf8\" class=\"wp-smiley\" \/> Sponsor us <\/strong><\/a><\/div>\n<\/div>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/08\/12\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\">Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The landscape of software engineering automation is evolving rapidly, driven by advances in Large Language Models (LLMs). However, most approaches to training capable agents rely on proprietary models or costly teacher-based methods, leaving open-weight LLMs with limited capabilities in real-world scenarios. A team of researchers from Nebius AI and Humanoid introduced a reinforcement learning framework for training long-context, multi-turn software engineering agents using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm. The research explains a technical breakthrough in applying reinforcement learning (RL) to open-source LLMs for genuine, multi-turn software engineering tasks\u2014moving beyond the single-turn, bandit-style settings that dominate RL for LLMs today. Beyond Single-Turn Reinforcement Learning RL Most RL methods for LLMs optimize for tasks such as mathematical reasoning or one-shot code generation, where agent actions are rewarded only at the conclusion and environments do not provide intermediate feedback. However, software engineering (SWE) is fundamentally different: it requires agents to operate over long sequences of actions, interpret rich feedback (compiler errors, test logs), and maintain context over hundreds of thousands of tokens\u2014far exceeding typical single-step interaction loops. Core Challenges in RL for SWE Long-Horizon Reasoning: Agents must sustain logical coherence across many steps, often requiring context windows beyond 100k tokens. Stateful Environment Feedback: Actions yield meaningful, non-trivial observations (e.g., shell command outputs, test suite results) that guide subsequent decisions. Sparse\/Delayed Rewards: Success signals typically emerge only at the end of complex interactions, complicating credit assignment. Evaluation Complexity: Measuring progress requires full trajectory unrolling and can be noisy due to test flakiness. The Technical Recipe: Modified DAPO and Agent Design The research team demonstrates a two-stage learning pipeline for training a Qwen2.5-72B-Instruct agent: 1. Rejection Fine-Tuning (RFT) The journey begins with supervised fine-tuning. The agent is run across 7,249 rigorously filtered SWE tasks (from the SWE-REBENCH dataset). Successful interaction traces\u2014where the agent passes the environmental test suite\u2014are used to fine-tune the model, particularly masking invalid environment-formatting actions during training. This alone boosts baseline accuracy from 11% to 20% on the SWE-bench Verified benchmark. 2. Reinforcement Learning Using Modified DAPO Building on Decoupled Advantage Policy Optimization (DAPO), several key modifications are introduced for scalability and stability: Asymmetric Clipping: Prevents collapse in policy entropy, maintaining exploration. Dynamic Sample Filtering: Focuses optimization on trajectories with actual learning signal. Length Penalties: Discourages excessive episode length, helping the agent avoid getting stuck in loops. Token-Level Averaging: Every token in every trajectory contributes equally to the gradient, empowering longer trajectories to influence updates. The agent utilizes a ReAct-style loop, which lets it combine reasoning steps with tool usage. Its supported toolkit includes arbitrary shell commands, precise code edits, navigation\/search utilities, and a submit action to signal episode completion. Each interaction is grounded in a robust sandboxed environment, initialized from real repository snapshots and backed by a GitHub-style issue prompt. Scaling to Long Contexts and Real Benchmarks Initially trained with a context length of 65k tokens (already double that of most open models), performance stalls at 32%. A second RL phase expands the context to 131k tokens and doubles the episode length ceiling, focusing subsequent training on only the most beneficial tasks from the pool. This enables scaling to longer stack traces and diff histories inherent to real-world debugging and patching tasks. Results: Closing the Gap with Baselines The final RL-trained agent attains 39% Pass@1 accuracy on the SWE-bench Verified benchmark, doubling the rejection fine-tuned baseline, and matching the performance of cutting-edge open-weight models such as DeepSeek-V3-0324, all without teacher-based supervision. On held-out SWE-rebench splits, scores remain competitive (35% for May, 31.7% for June), indicating the method\u2019s robustness. When compared head-to-head with top open baselines and specialized SWE agents, the RL agent matches or outperforms several models, confirming the effectiveness of the RL methodology in this domain. Pass@1 SWE-bench Verified Pass@10 Pass@1 SWE-rebench May Pass@10 Qwen2.5-72B-Instruct (RL, final) 39.04% 58.4% 35.0% 52.5% DeepSeek-V3-0324 39.56% 62.2% 36.75% 60.0% Qwen3-235B no-thinking 25.84% 54.4% 27.25% 57.5% Llama4 Maverick 15.84% 47.2% 19.0% 50.0% Pass@1 scores are averaged over 10 runs and reported as mean \u00b1 standard error. Key Insights Credit Assignment: RL in this sparse-reward regime remains fundamentally challenging. The paper suggests future work with reward shaping, step-level critics, or prefix-based rollouts for more granular feedback. Uncertainty Estimation: Real-world agents need to know when to abstain or express confidence. Techniques like output entropy or explicit confidence scoring are next steps. Infrastructure: Training utilized context parallelism (splitting long sequences over GPUs) on 16 H200 nodes, with distributed orchestration via Kubernetes and Tracto AI, and vLLM for fast inference. Conclusion This research validates RL as a potent paradigm for building autonomous software engineers using open-weight LLMs. By conquering long-horizon, multi-turn, real-environment tasks, the methodology paves the way for scalable, teacher-free agent development\u2014directly leveraging the power of interaction rather than static instruction. With further refinements, such RL pipelines promise efficient, reliable, and versatile automation for the future of software engineering. Check out the Paper here. Feel free to check out our\u00a0GitHub Page for Tutorials, Codes and Notebooks.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Star us on GitHub Sponsor us The post Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":31373,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-31372","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-08-13T06:02:47+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents\",\"datePublished\":\"2025-08-13T06:02:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\"},\"wordCount\":871,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\",\"url\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\",\"name\":\"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp\",\"datePublished\":\"2025-08-13T06:02:47+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp\",\"width\":1024,\"height\":621},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/","og_locale":"fr_FR","og_type":"article","og_title":"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-08-13T06:02:47+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents","datePublished":"2025-08-13T06:02:47+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/"},"wordCount":871,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/","url":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/","name":"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp","datePublished":"2025-08-13T06:02:47+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp","width":1024,"height":621},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nebius-ai-advances-open-weight-llms-through-reinforcement-learning-for-capable-swe-agents\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp",1024,621,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp",1024,621,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp",1024,621,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4-150x150.webp",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4-300x182.webp",300,182,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp",1024,621,false],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp",1024,621,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4.webp",1024,621,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4-18x12.webp",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4-300x300.webp",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4-600x364.webp",600,364,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-12-at-9.28.42-PM-1-1024x621-bRIzi4-100x100.webp",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"The landscape of software engineering automation is evolving rapidly, driven by advances in Large Language Models (LLMs). However, most approaches to training capable agents rely on proprietary models or costly teacher-based methods, leaving open-weight LLMs with limited capabilities in real-world scenarios. A team of researchers from Nebius AI and Humanoid introduced a reinforcement learning framework\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/31372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=31372"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/31372\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/31373"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=31372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=31372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=31372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}