{"id":93275,"date":"2026-05-27T17:14:04","date_gmt":"2026-05-27T17:14:04","guid":{"rendered":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/"},"modified":"2026-05-27T17:14:04","modified_gmt":"2026-05-27T17:14:04","slug":"nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code","status":"publish","type":"post","link":"https:\/\/youzum.net\/de\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/","title":{"rendered":"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Reinforcement learning for language agents is growing more complex. Agents now manage multi-turn tool use, long-running contexts, and multi-agent orchestration. The main engineering challenge is connecting existing agent software to training pipelines without breaking how those tools work.<\/p>\n<p class=\"wp-block-paragraph\">NVIDIA\u2019s research team introduced <strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.24220\" target=\"_blank\" rel=\"noreferrer noopener\">Polar<\/a><\/strong>, a rollout framework that lets researchers run reinforcement learning over any agent harness without modifying that harness.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Core Problem Polar Solves<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">An \u2018agent harness\u2019 is a tool like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses manage system prompts, tool formatting, context engineering, and how the agent submits patches. These details directly affect agent behavior at evaluation time.<\/p>\n<p class=\"wp-block-paragraph\">Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned environment API \u2014 typically <code>env.init()<\/code>, <code>env.step()<\/code>, <code>env.reset()<\/code> in the OpenAI Gym style. Every new harness requires new integration code. That integration can also lose execution details specific to the native harness path.<\/p>\n<p class=\"wp-block-paragraph\">Polar\u2019s key observation is that every LLM-based agent must call a model. That model API boundary is a common interface outside the agent itself. Instead of integrating inside the harness, Polar places a proxy at that boundary.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How the Proxy Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>For each incoming model request, the gateway proxy performs four steps:<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li><strong>Detect the provider API<\/strong> \u2014 using the request path and headers, it distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls.<\/li>\n<li><strong>Normalize the request<\/strong> \u2014 converts roles, content parts, tool definitions, and generation parameters into the OpenAI Chat Completions shape used by the local inference server.<\/li>\n<li><strong>Capture token-level data<\/strong> \u2014 stores request messages, response messages, prompt token IDs, sampled response token IDs, finish reason, and log probabilities.<\/li>\n<li><strong>Return the provider shape<\/strong> \u2014 transforms the response back into the schema the harness expects.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">For streaming requests, Polar obtains a non-streaming upstream response and emits a synthetic provider-shaped stream. This preserves compatibility with harnesses that expect server-sent events while ensuring complete token capture.<\/p>\n<p class=\"wp-block-paragraph\">The only required change to an existing harness is pointing its model base URL at the gateway.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1334\" height=\"872\" data-attachment-id=\"80140\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/screenshot-2026-05-27-at-10-08-51-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1.png\" data-orig-size=\"1334,872\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-27 at 10.08.51\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-1024x669.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1.png\" alt=\"\" class=\"wp-image-80140\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.24220<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Architecture: Rollout Server and Gateway Nodes<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Polar has two core components<\/strong>:<\/p>\n<p class=\"wp-block-paragraph\">The <strong>rollout server<\/strong> accepts a <code>TaskRequest<\/code> and expands it into <code>num_samples<\/code> independent sessions. Each session carries a session ID, task ID, timeout budget, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches sessions to gateway nodes and accepts callbacks when sessions complete.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Gateway nodes<\/strong> own the lifecycle of each session \u2014 starting the runtime, running the harness, building trajectories, evaluating output, and teardown. The gateway also hosts the proxy endpoint for that session\u2019s model calls, keeping completion capture tied to the session registry.<\/p>\n<p class=\"wp-block-paragraph\">Within each gateway, isolated worker pools handle INIT, RUNNING, and POSTRUN stages. A bounded READY buffer holds initialized runtimes until a run slot is available. CPU-heavy runtime preparation and evaluator prewarm proceed off the critical path, without blocking active GPU-bound agent execution. If a harness times out after model calls have been captured, the gateway still enters POSTRUN so partial traces can be recovered.<\/p>\n<p class=\"wp-block-paragraph\">Built-in evaluators include a session-completion reward, a configurable test-on-output evaluator, and a SWE-Bench\/SWE-Gym harness evaluator. Custom evaluators can be added through a registry interface.<\/p>\n<p class=\"wp-block-paragraph\">Polar currently supports Docker and rootless Apptainer runtimes. Built-in harness shortcuts include <code>codex<\/code>, <code>claude_code<\/code>, <code>gemini_cli<\/code>, <code>qwen_code<\/code>, <code>opencode<\/code>, and <code>pi<\/code>. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Trajectory Reconstruction: Per Request vs. Prefix Merging<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">After a session completes, Polar reconstructs trainable trajectories from captured model calls. <\/p>\n<p class=\"wp-block-paragraph\"><strong>Two strategies are available:<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The <strong><code>per_request<\/code><\/strong> builder treats every model call as one independent trace. It is lossless per individual call but fragments multi-turn sessions. A single coding problem can produce hundreds of per-request traces, increasing the burden on downstream trainers.<\/p>\n<p class=\"wp-block-paragraph\">The <strong><code>prefix_merging<\/code><\/strong> builder reconstructs longer traces where the harness session preserves append-only conversation histories. It partitions completions into ordered chains by verifying a strict token-prefix relation between adjacent completions. Sub-agents, context compaction boundaries, and parallel agent branches naturally form separate chains. Within each merged trace, only sampled assistant tokens are marked trainable. Canonical interstitial tokens receive a loss mask of zero.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Ablation Results<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">The research team benchmarks both strategies on the same model, hardware, and topology over three training steps.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Metric<\/th>\n<th><code>per_request<\/code><\/th>\n<th><code>prefix_merging<\/code><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Trainer updates<\/td>\n<td>1,185<\/td>\n<td>218<\/td>\n<\/tr>\n<tr>\n<td>Wall-clock time<\/td>\n<td>189.5 min<\/td>\n<td>35.2 min<\/td>\n<\/tr>\n<tr>\n<td>Speedup<\/td>\n<td>\u2014<\/td>\n<td><strong>5.39\u00d7<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Avg. rollout GPU utilization<\/td>\n<td>20.4%<\/td>\n<td>87.7%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\"><strong>SWE-Bench Verified Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Training uses standard GRPO on the Qwen3.5-4B base model. The dataset is SkyRL-v0-293-data SWE-Gym (293 tasks, 1 epoch, rollout batch size 4, 16 samples per prompt) with the Slime trainer. All experiments use <code>prefix_merging<\/code> for trajectory construction.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Training Rollout Reward Progress (pass@1)<\/strong><\/h4>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Harness<\/th>\n<th>First 10 Steps<\/th>\n<th>Last 10 Steps<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Codex<\/td>\n<td>9.5%<\/td>\n<td>54.5%<\/td>\n<\/tr>\n<tr>\n<td>Claude Code<\/td>\n<td>28.8%<\/td>\n<td>67.0%<\/td>\n<\/tr>\n<tr>\n<td>Qwen Code<\/td>\n<td>61.6%<\/td>\n<td>66.0%<\/td>\n<\/tr>\n<tr>\n<td>Pi<\/td>\n<td>61.6%<\/td>\n<td>76.2%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h4 class=\"wp-block-heading\"><strong>SWE-Bench Verified Final Scores<\/strong><\/h4>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Harness<\/th>\n<th>Base<\/th>\n<th>Polar RL<\/th>\n<th>Gain<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Codex<\/td>\n<td>3.8%<\/td>\n<td>26.4%<\/td>\n<td>+22.6 pts<\/td>\n<\/tr>\n<tr>\n<td>Claude Code<\/td>\n<td>29.8%<\/td>\n<td>34.6%<\/td>\n<td>+4.8 pts<\/td>\n<\/tr>\n<tr>\n<td>Qwen Code<\/td>\n<td>34.6%<\/td>\n<td>35.2%<\/td>\n<td>+0.6 pts<\/td>\n<\/tr>\n<tr>\n<td>Pi<\/td>\n<td>34.2%<\/td>\n<td>40.4%<\/td>\n<td>+6.2 pts<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">The largest gain is under Codex. Codex presents an unfamiliar action protocol and patch-submission style to a Qwen model not originally trained on that harness. Polar attaches the reward signal to the actual sampled tokens flowing through the Codex execution path, so GRPO optimizes the behavior the model uses at evaluation time. Under the native Qwen Code harness, where the base model is already well-aligned, Polar still delivers a 0.6 point gain.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Offline SFT Data Generation<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Polar can also serve as a distributed offline data generation service with no changes to the runtime. The research team demonstrates this using Qwen3.5-122B-A10B on an 8\u00d7H100 server (TP=8, max_model_len=32,768) with the pi harness against 1,638 instances from seven SWE-Gym repositories.<\/p>\n<p class=\"wp-block-paragraph\">A trajectory is accepted into the SFT corpus only if the SWE-Bench evaluation harness confirms the agent\u2019s patch resolves every <code>FAIL_TO_PASS<\/code> test and leaves every <code>PASS_TO_PASS<\/code> test green.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Repository<\/th>\n<th>Attempts<\/th>\n<th>Accepted<\/th>\n<th>Rate<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>getmoto\/moto<\/td>\n<td>343<\/td>\n<td>184<\/td>\n<td>53.6%<\/td>\n<\/tr>\n<tr>\n<td>python\/mypy<\/td>\n<td>257<\/td>\n<td>101<\/td>\n<td>39.3%<\/td>\n<\/tr>\n<tr>\n<td>conan-io\/conan<\/td>\n<td>71<\/td>\n<td>27<\/td>\n<td>38.0%<\/td>\n<\/tr>\n<tr>\n<td>pydantic\/pydantic<\/td>\n<td>81<\/td>\n<td>24<\/td>\n<td>29.6%<\/td>\n<\/tr>\n<tr>\n<td>iterative\/dvc<\/td>\n<td>219<\/td>\n<td>45<\/td>\n<td>20.5%<\/td>\n<\/tr>\n<tr>\n<td>pandas-dev\/pandas<\/td>\n<td>477<\/td>\n<td>98<\/td>\n<td>19.7%<\/td>\n<\/tr>\n<tr>\n<td>dask\/dask<\/td>\n<td>141<\/td>\n<td>25<\/td>\n<td>17.7%<\/td>\n<\/tr>\n<tr>\n<td><strong>Total<\/strong><\/td>\n<td><strong>1,638<\/strong><\/td>\n<td><strong>504<\/strong><\/td>\n<td><strong>30.8%<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">The run cost roughly 64 GPU-hours. Accepted trajectories average 104 messages per session and 51 assistant turns. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Framework Comparison<\/strong><\/h2>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>System<\/th>\n<th>Async RL<\/th>\n<th>Async Rollout Staging<\/th>\n<th>Rollout as Service<\/th>\n<th>Harness Agnostic<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Polar<\/strong><\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<\/tr>\n<tr>\n<td>ProRL Agent<\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<td>\u2717<\/td>\n<\/tr>\n<tr>\n<td>SkyRL-Agent<\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<td>\u2717<\/td>\n<td>partial<\/td>\n<\/tr>\n<tr>\n<td>PRIME-RL<\/td>\n<td>\u2713<\/td>\n<td>\u2717<\/td>\n<td>\u2717<\/td>\n<td>\u2717<\/td>\n<\/tr>\n<tr>\n<td>Agent Lightning<\/td>\n<td>partial<\/td>\n<td>\u2717<\/td>\n<td>partial<\/td>\n<td>partial<\/td>\n<\/tr>\n<tr>\n<td>rLLM<\/td>\n<td>partial<\/td>\n<td>\u2717<\/td>\n<td>\u2717<\/td>\n<td>\u2717<\/td>\n<\/tr>\n<tr>\n<td>OpenClaw-RL<\/td>\n<td>\u2713<\/td>\n<td>\u2717<\/td>\n<td>\u2717<\/td>\n<td>partial<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Polar is the only system in this comparison with first-class support across all four properties.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths and Limitations<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>No harness code changes required \u2014 the proxy intercepts at the model API boundary<\/li>\n<li>Provider-agnostic: supports Anthropic, OpenAI Chat, OpenAI Responses, and Google API formats natively<\/li>\n<li><code>prefix_merging<\/code> reduces trainer updates from 1,185 to 218 and cuts wall-clock time 5.39\u00d7<\/li>\n<li>Works for both online RL and offline SFT data generation with the same runtime<\/li>\n<li>Harness-native RL delivers large gains for unfamiliar execution paths \u2014 22.6 pts on Codex<\/li>\n<li>Partial traces are recovered when a harness times out mid-session<\/li>\n<li>Released as open source under NeMo Gym<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Limitations<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Reward design, evaluator quality, and distribution shift remain the researcher\u2019s responsibility<\/li>\n<li>Requires the harness to support a configurable model base URL<\/li>\n<li>Token-level capture depends on the serving stack supplying reliable token IDs and log probabilities<\/li>\n<li><code>per_request<\/code> strategy produced reward hacking in experiments due to noisy credit assignment at the session level; session normalization and PRM-style credit assignment are on the roadmap<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- HEADER --><\/p>\n<div class=\"mpl-hdr\">\n<div class=\"mpl-hdr-left\">\n      <span class=\"mpl-badge\">NVIDIA Research<\/span><br \/>\n      <span class=\"mpl-hdr-title\">Polar \u2014 Agentic RL Framework<\/span>\n    <\/div>\n<p>    <span class=\"mpl-hdr-right\">arXiv:2605.24220<\/span>\n  <\/p><\/div>\n<p>  <!-- VIEWPORT --><\/p>\n<div class=\"mpl-viewport\">\n<div class=\"mpl-track\">\n<p>      <!-- SLIDE 0 \u00b7 COVER --><\/p>\n<div class=\"mpl-cover\">\n        <span class=\"mpl-cover-tag\">NeMo Gym \u2014 May 2026<\/span>\n<div class=\"mpl-cover-title\">Polar: <span>Agentic RL<\/span><br \/>on Any Harness<\/div>\n<div class=\"mpl-cover-sub\">NVIDIA\u2019s rollout framework trains LLM agents via RL without modifying their harnesses. A model API proxy captures token-level interactions and reconstructs trainer-ready trajectories.<\/div>\n<div class=\"mpl-cover-pills\">\n          <span class=\"mpl-pill\">GRPO Training<\/span><br \/>\n          <span class=\"mpl-pill\">Token-Faithful Trajectories<\/span><br \/>\n          <span class=\"mpl-pill\">SWE-Bench Verified<\/span><br \/>\n          <span class=\"mpl-pill\">Apache-2.0<\/span><br \/>\n          <span class=\"mpl-pill\">NeMo Gym<\/span>\n        <\/div>\n<\/div>\n<p>      <!-- SLIDE 1 \u00b7 THE PROBLEM --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">01 \u2014 The Problem<\/span>\n<div class=\"mpl-h2\">Why RL Integration With Agent Harnesses Is Hard<\/div>\n<div class=\"mpl-body\">Harnesses like Codex CLI, Claude Code, Qwen Code, and Pi manage system prompts, tool formatting, and patch submission. Traditional RL requires rewriting this logic behind a framework-owned environment API.<\/div>\n<div class=\"mpl-steps\">\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">1<\/div>\n<div>\n<div class=\"mpl-step-title\">Every new harness requires new integration code<\/div>\n<div class=\"mpl-step-desc\">Systems like SkyRL-Agent and PRIME-RL require agents to conform to RL infrastructure, not the other way around.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">2<\/div>\n<div>\n<div class=\"mpl-step-title\">Integration loses native execution details<\/div>\n<div class=\"mpl-step-desc\">Rewriting a harness behind an env API can drop context policies, tool schemas, and orchestration logic that matter at eval time.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">3<\/div>\n<div>\n<div class=\"mpl-step-title\">Polar\u2019s key insight<\/div>\n<div class=\"mpl-step-desc\">Every LLM-based agent must call a model. Polar places a proxy at that API boundary instead of integrating inside the harness.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- SLIDE 2 \u00b7 THE PROXY --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">02 \u2014 The Proxy<\/span>\n<div class=\"mpl-h2\">How Polar Captures LLM Calls (4 Steps)<\/div>\n<div class=\"mpl-body\">The only change to an existing harness is pointing its model base URL at the gateway.<\/div>\n<div class=\"mpl-steps\">\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">1<\/div>\n<div>\n<div class=\"mpl-step-title\">Detect the provider API<\/div>\n<div class=\"mpl-step-desc\">Distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent from request path and headers.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">2<\/div>\n<div>\n<div class=\"mpl-step-title\">Normalize the request<\/div>\n<div class=\"mpl-step-desc\">Converts roles, content parts, tool definitions, and generation parameters into the OpenAI Chat Completions shape for the local inference server.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">3<\/div>\n<div>\n<div class=\"mpl-step-title\">Capture token-level data<\/div>\n<div class=\"mpl-step-desc\">Stores request messages, response messages, prompt token IDs, sampled response token IDs, finish reason, and log probabilities.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-step\">\n<div class=\"mpl-step-num\">4<\/div>\n<div>\n<div class=\"mpl-step-title\">Return the provider shape<\/div>\n<div class=\"mpl-step-desc\">Transforms the response back into the schema the harness expects. Streaming requests receive a synthetic provider-shaped stream.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- SLIDE 3 \u00b7 ARCHITECTURE --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">03 \u2014 Architecture<\/span>\n<div class=\"mpl-h2\">Rollout Server &amp; Gateway Nodes<\/div>\n<div class=\"mpl-2col\">\n<div class=\"mpl-card\">\n<div class=\"mpl-card-title\">Rollout Server<\/div>\n<div class=\"mpl-card-body\">Accepts a <span class=\"mpl-code\">TaskRequest<\/span>, expands into <span class=\"mpl-code\">num_samples<\/span> sessions. Each session carries session ID, task ID, timeout, runtime spec, agent spec, trajectory builder, evaluator, and callback URL. Dispatches to gateways and tracks status.<\/div>\n<\/div>\n<div class=\"mpl-card\">\n<div class=\"mpl-card-title\">Gateway Nodes<\/div>\n<div class=\"mpl-card-body\">Own the full session lifecycle: start runtime \u2014 run harness \u2014 build trajectories \u2014 evaluate \u2014 teardown. Worker pools INIT \/ READY \/ RUNNING \/ POSTRUN run in isolation. Times-out gracefully; partial traces are recovered.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-div\"><\/div>\n<div class=\"mpl-sm\">Runtimes: <span class=\"mpl-code\">Docker<\/span> &amp; rootless <span class=\"mpl-code\">Apptainer<\/span><\/div>\n<div class=\"mpl-sm\">Built-in harnesses:<\/div>\n<div class=\"mpl-pills\">\n          <span class=\"mpl-chip\">codex<\/span><br \/>\n          <span class=\"mpl-chip\">claude_code<\/span><br \/>\n          <span class=\"mpl-chip\">gemini_cli<\/span><br \/>\n          <span class=\"mpl-chip\">qwen_code<\/span><br \/>\n          <span class=\"mpl-chip\">opencode<\/span><br \/>\n          <span class=\"mpl-chip\">pi<\/span>\n        <\/div>\n<div class=\"mpl-div\"><\/div>\n<div class=\"mpl-sm\">Built-in evaluators:<\/div>\n<div class=\"mpl-pills\">\n          <span class=\"mpl-chip\">session-completion reward<\/span><br \/>\n          <span class=\"mpl-chip\">test-on-output<\/span><br \/>\n          <span class=\"mpl-chip\">SWE-Bench \/ SWE-Gym harness<\/span>\n        <\/div>\n<\/div>\n<p>      <!-- SLIDE 4 \u00b7 TRAJECTORY RECONSTRUCTION --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">04 \u2014 Trajectory Reconstruction<\/span>\n<div class=\"mpl-h2\">per_request vs. prefix_merging<\/div>\n<div class=\"mpl-2col\">\n<div class=\"mpl-card\">\n<div class=\"mpl-card-title\">per_request<\/div>\n<div class=\"mpl-card-body\">Every model call becomes one trace. Lossless per call but fragments multi-turn sessions. One coding problem can produce hundreds of traces. Produces reward hacking at session level due to noisy credit assignment.<\/div>\n<\/div>\n<div class=\"mpl-card\">\n<div class=\"mpl-card-title\">prefix_merging<\/div>\n<div class=\"mpl-card-body\">Reconstructs longer traces via strict token-prefix relation. Sub-agents, context compaction, and parallel branches form separate chains. Only sampled tokens are trainable; interstitials are loss-masked to zero.<\/div>\n<\/div>\n<\/div>\n<div class=\"mpl-div\"><\/div>\n<div class=\"mpl-label\">Ablation \u2014 same model, hardware &amp; topology, 3 training steps<\/div>\n<div class=\"mpl-table-wrap\">\n<table class=\"mpl-table\">\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>per_request<\/th>\n<th>prefix_merging<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Trainer updates<\/td>\n<td>1,185<\/td>\n<td class=\"mpl-g\">218<\/td>\n<\/tr>\n<tr>\n<td>Wall-clock time<\/td>\n<td>189.5 min<\/td>\n<td class=\"mpl-g\">35.2 min<\/td>\n<\/tr>\n<tr>\n<td>Speedup<\/td>\n<td>\u2014<\/td>\n<td class=\"mpl-g\">5.39\u00d7<\/td>\n<\/tr>\n<tr>\n<td>Avg. rollout GPU util.<\/td>\n<td>20.4%<\/td>\n<td class=\"mpl-g\">87.7%<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<\/div>\n<p>      <!-- SLIDE 5 \u00b7 SWE-BENCH RESULTS --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">05 \u2014 SWE-Bench Verified Results<\/span>\n<div class=\"mpl-h2\">GRPO on Qwen3.5-4B Across Four Harnesses<\/div>\n<div class=\"mpl-sm\">SkyRL-v0-293-data \u2014 293 tasks \u2014 1 epoch \u2014 batch size 4 \u2014 16 samples\/prompt \u2014 Slime trainer \u2014 <span class=\"mpl-code\">prefix_merging<\/span><\/div>\n<div class=\"mpl-table-wrap\">\n<table class=\"mpl-table\">\n<thead>\n<tr>\n<th>Harness<\/th>\n<th>Base<\/th>\n<th>Polar RL<\/th>\n<th>Gain<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Codex<\/td>\n<td>3.8%<\/td>\n<td class=\"mpl-g\">26.4%<\/td>\n<td class=\"mpl-w\">+22.6 pts<\/td>\n<\/tr>\n<tr>\n<td>Claude Code<\/td>\n<td>29.8%<\/td>\n<td class=\"mpl-g\">34.6%<\/td>\n<td class=\"mpl-w\">+4.8 pts<\/td>\n<\/tr>\n<tr>\n<td>Qwen Code<\/td>\n<td>34.6%<\/td>\n<td class=\"mpl-g\">35.2%<\/td>\n<td class=\"mpl-w\">+0.6 pts<\/td>\n<\/tr>\n<tr>\n<td>Pi<\/td>\n<td>34.2%<\/td>\n<td class=\"mpl-g\">40.4%<\/td>\n<td class=\"mpl-w\">+6.2 pts<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<div class=\"mpl-div\"><\/div>\n<div class=\"mpl-stats\">\n<div class=\"mpl-stat\">\n            <span class=\"mpl-stat-val\">+22.6<\/span><br \/>\n            <span class=\"mpl-stat-label\">pts gain on Codex<br \/>(3.8% \u2192 26.4%)<\/span>\n          <\/div>\n<div class=\"mpl-stat\">\n            <span class=\"mpl-stat-val\">5.39\u00d7<\/span><br \/>\n            <span class=\"mpl-stat-label\">faster training with<br \/>prefix_merging<\/span>\n          <\/div>\n<\/div>\n<\/div>\n<p>      <!-- SLIDE 6 \u00b7 OFFLINE SFT --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">06 \u2014 Offline SFT Data Generation<\/span>\n<div class=\"mpl-h2\">Generating SFT Trajectories at Scale<\/div>\n<div class=\"mpl-sm\">Qwen3.5-122B-A10B \u2014 8\u00d7H100 (TP=8, max_model_len=32,768) \u2014 pi harness \u2014 1,638 instances \u2014 ~64 GPU-hours \u2014 Apache-2.0<\/div>\n<div class=\"mpl-table-wrap\">\n<table class=\"mpl-table\">\n<thead>\n<tr>\n<th>Repository<\/th>\n<th>Attempts<\/th>\n<th>Accepted<\/th>\n<th>Rate<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>getmoto\/moto<\/td>\n<td>343<\/td>\n<td>184<\/td>\n<td class=\"mpl-g\">53.6%<\/td>\n<\/tr>\n<tr>\n<td>python\/mypy<\/td>\n<td>257<\/td>\n<td>101<\/td>\n<td>39.3%<\/td>\n<\/tr>\n<tr>\n<td>conan-io\/conan<\/td>\n<td>71<\/td>\n<td>27<\/td>\n<td>38.0%<\/td>\n<\/tr>\n<tr>\n<td>pydantic\/pydantic<\/td>\n<td>81<\/td>\n<td>24<\/td>\n<td>29.6%<\/td>\n<\/tr>\n<tr>\n<td>iterative\/dvc<\/td>\n<td>219<\/td>\n<td>45<\/td>\n<td>20.5%<\/td>\n<\/tr>\n<tr>\n<td>pandas-dev\/pandas<\/td>\n<td>477<\/td>\n<td>98<\/td>\n<td>19.7%<\/td>\n<\/tr>\n<tr>\n<td>dask\/dask<\/td>\n<td>141<\/td>\n<td>25<\/td>\n<td>17.7%<\/td>\n<\/tr>\n<tr>\n<td><strong>Total<\/strong><\/td>\n<td><strong>1,638<\/strong><\/td>\n<td><strong>504<\/strong><\/td>\n<td class=\"mpl-g\"><strong>30.8%<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<div class=\"mpl-div\"><\/div>\n<div class=\"mpl-sm\">Avg. 104 messages\/session \u2014 51 assistant turns \u2014 90\/10 train\/test split by repository<\/div>\n<\/div>\n<p>      <!-- SLIDE 7 \u00b7 KEY TAKEAWAYS --><\/p>\n<div class=\"mpl-slide\">\n        <span class=\"mpl-label\">07 \u2014 Key Takeaways<\/span>\n<div class=\"mpl-h2\">What Engineers Should Know<\/div>\n<ul class=\"mpl-list\">\n<li>Polar trains LLM agents via a model API proxy \u2014 no harness code changes required.<\/li>\n<li>Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs natively.<\/li>\n<li><span class=\"mpl-code\">prefix_merging<\/span> cuts trainer updates from 1,185 to 218 and wall-clock time by 5.39\u00d7 vs. <span class=\"mpl-code\">per_request<\/span>.<\/li>\n<li>GRPO on Qwen3.5-4B improves SWE-Bench Verified by up to 22.6 pts (Codex) across all four harnesses.<\/li>\n<li>Works for online RL and offline SFT data generation with the same runtime \u2014 no orchestration changes needed.<\/li>\n<li>Reward design, evaluator quality, and distribution shift remain the researcher\u2019s responsibility.<\/li>\n<li>Code: github.com\/NVIDIA-NeMo\/ProRL-Agent-Server \u2014 registered as a NeMo Gym environment.<\/li>\n<\/ul><\/div>\n<\/div>\n<p><!-- \/mpl-track -->\n  <\/p><\/div>\n<p><!-- \/mpl-viewport --><\/p>\n<p>  <!-- NAV --><\/p>\n<div class=\"mpl-nav\">\n    <button class=\"mpl-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"mpl-dots\"><\/div>\n<p>    <span class=\"mpl-pg\">1 \/ 8<\/span><br \/>\n    <button class=\"mpl-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<p>  <!-- FOOTER --><\/p>\n<div class=\"mpl-foot\">\n    <span class=\"mpl-foot-brand\">Marktechpost <span>\u2014 AI Research, Simplified for Engineers<\/span><\/span><br \/>\n    <span class=\"mpl-foot-ref\">arXiv:2605.24220<\/span>\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Polar trains LLM agents via a model API proxy \u2014 no harness code changes required<\/li>\n<li>Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs<\/li>\n<li>Using GRPO on Qwen3.5-4B, Polar improves SWE-Bench Verified by up to 22.6 points across four coding harnesses<\/li>\n<li><code>prefix_merging<\/code> trajectory reconstruction delivers a 5.39\u00d7 wall-clock speedup over <code>per_request<\/code><\/li>\n<li>Generated 504 accepted SFT trajectories from 1,638 attempts (30.8%) at ~64 GPU-hours; released under Apache-2.0<\/li>\n<li>Rewrites ProRL Agent; registered as a NeMo Gym environment<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.24220\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> <\/strong>and<strong> <a href=\"https:\/\/github.com\/NVIDIA-NeMo\/ProRL-Agent-Server\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\">NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Reinforcement learning for language agents is growing more complex. Agents now manage multi-turn tool use, long-running contexts, and multi-agent orchestration. The main engineering challenge is connecting existing agent software to training pipelines without breaking how those tools work. NVIDIA\u2019s research team introduced Polar, a rollout framework that lets researchers run reinforcement learning over any agent harness without modifying that harness. The Core Problem Polar Solves An \u2018agent harness\u2019 is a tool like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses manage system prompts, tool formatting, context engineering, and how the agent submits patches. These details directly affect agent behavior at evaluation time. Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned environment API \u2014 typically env.init(), env.step(), env.reset() in the OpenAI Gym style. Every new harness requires new integration code. That integration can also lose execution details specific to the native harness path. Polar\u2019s key observation is that every LLM-based agent must call a model. That model API boundary is a common interface outside the agent itself. Instead of integrating inside the harness, Polar places a proxy at that boundary. How the Proxy Works For each incoming model request, the gateway proxy performs four steps: Detect the provider API \u2014 using the request path and headers, it distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls. Normalize the request \u2014 converts roles, content parts, tool definitions, and generation parameters into the OpenAI Chat Completions shape used by the local inference server. Capture token-level data \u2014 stores request messages, response messages, prompt token IDs, sampled response token IDs, finish reason, and log probabilities. Return the provider shape \u2014 transforms the response back into the schema the harness expects. For streaming requests, Polar obtains a non-streaming upstream response and emits a synthetic provider-shaped stream. This preserves compatibility with harnesses that expect server-sent events while ensuring complete token capture. The only required change to an existing harness is pointing its model base URL at the gateway. https:\/\/arxiv.org\/pdf\/2605.24220 Architecture: Rollout Server and Gateway Nodes Polar has two core components: The rollout server accepts a TaskRequest and expands it into num_samples independent sessions. Each session carries a session ID, task ID, timeout budget, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches sessions to gateway nodes and accepts callbacks when sessions complete. Gateway nodes own the lifecycle of each session \u2014 starting the runtime, running the harness, building trajectories, evaluating output, and teardown. The gateway also hosts the proxy endpoint for that session\u2019s model calls, keeping completion capture tied to the session registry. Within each gateway, isolated worker pools handle INIT, RUNNING, and POSTRUN stages. A bounded READY buffer holds initialized runtimes until a run slot is available. CPU-heavy runtime preparation and evaluator prewarm proceed off the critical path, without blocking active GPU-bound agent execution. If a harness times out after model calls have been captured, the gateway still enters POSTRUN so partial traces can be recovered. Built-in evaluators include a session-completion reward, a configurable test-on-output evaluator, and a SWE-Bench\/SWE-Gym harness evaluator. Custom evaluators can be added through a registry interface. Polar currently supports Docker and rootless Apptainer runtimes. Built-in harness shortcuts include codex, claude_code, gemini_cli, qwen_code, opencode, and pi. Trajectory Reconstruction: Per Request vs. Prefix Merging After a session completes, Polar reconstructs trainable trajectories from captured model calls. Two strategies are available: The per_request builder treats every model call as one independent trace. It is lossless per individual call but fragments multi-turn sessions. A single coding problem can produce hundreds of per-request traces, increasing the burden on downstream trainers. The prefix_merging builder reconstructs longer traces where the harness session preserves append-only conversation histories. It partitions completions into ordered chains by verifying a strict token-prefix relation between adjacent completions. Sub-agents, context compaction boundaries, and parallel agent branches naturally form separate chains. Within each merged trace, only sampled assistant tokens are marked trainable. Canonical interstitial tokens receive a loss mask of zero. Ablation Results The research team benchmarks both strategies on the same model, hardware, and topology over three training steps. Metric per_request prefix_merging Trainer updates 1,185 218 Wall-clock time 189.5 min 35.2 min Speedup \u2014 5.39\u00d7 Avg. rollout GPU utilization 20.4% 87.7% SWE-Bench Verified Results Training uses standard GRPO on the Qwen3.5-4B base model. The dataset is SkyRL-v0-293-data SWE-Gym (293 tasks, 1 epoch, rollout batch size 4, 16 samples per prompt) with the Slime trainer. All experiments use prefix_merging for trajectory construction. Training Rollout Reward Progress (pass@1) Harness First 10 Steps Last 10 Steps Codex 9.5% 54.5% Claude Code 28.8% 67.0% Qwen Code 61.6% 66.0% Pi 61.6% 76.2% SWE-Bench Verified Final Scores Harness Base Polar RL Gain Codex 3.8% 26.4% +22.6 pts Claude Code 29.8% 34.6% +4.8 pts Qwen Code 34.6% 35.2% +0.6 pts Pi 34.2% 40.4% +6.2 pts The largest gain is under Codex. Codex presents an unfamiliar action protocol and patch-submission style to a Qwen model not originally trained on that harness. Polar attaches the reward signal to the actual sampled tokens flowing through the Codex execution path, so GRPO optimizes the behavior the model uses at evaluation time. Under the native Qwen Code harness, where the base model is already well-aligned, Polar still delivers a 0.6 point gain. Offline SFT Data Generation Polar can also serve as a distributed offline data generation service with no changes to the runtime. The research team demonstrates this using Qwen3.5-122B-A10B on an 8\u00d7H100 server (TP=8, max_model_len=32,768) with the pi harness against 1,638 instances from seven SWE-Gym repositories. A trajectory is accepted into the SFT corpus only if the SWE-Bench evaluation harness confirms the agent\u2019s patch resolves every FAIL_TO_PASS test and leaves every PASS_TO_PASS test green. Repository Attempts Accepted Rate getmoto\/moto 343 184 53.6% python\/mypy 257 101 39.3% conan-io\/conan 71 27 38.0% pydantic\/pydantic 81 24 29.6% iterative\/dvc 219 45 20.5% pandas-dev\/pandas 477 98 19.7% dask\/dask 141 25 17.7% Total 1,638 504 30.8% The run cost roughly 64 GPU-hours. Accepted trajectories average 104 messages per session and 51 assistant turns. Framework Comparison System Async RL<\/p>","protected":false},"author":2,"featured_media":93276,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-93275","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/de\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/de\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-27T17:14:04+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"10\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code\",\"datePublished\":\"2026-05-27T17:14:04+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\"},\"wordCount\":1971,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\",\"url\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\",\"name\":\"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png\",\"datePublished\":\"2026-05-27T17:14:04+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png\",\"width\":1334,\"height\":872},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/de\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/de\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/","og_locale":"de_DE","og_type":"article","og_title":"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/de\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-27T17:14:04+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"admin NU","Gesch\u00e4tzte Lesezeit":"10\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code","datePublished":"2026-05-27T17:14:04+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/"},"wordCount":1971,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/","url":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/","name":"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png","datePublished":"2026-05-27T17:14:04+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png","width":1334,"height":872},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-releases-polar-a-token-faithful-rollout-framework-for-grpo-training-across-codex-claude-code-and-qwen-code\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/de\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png",1334,872,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png",1334,872,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png",1334,872,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-300x196.png",300,196,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-1024x669.png",1024,669,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png",1334,872,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz.png",1334,872,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-18x12.png",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-600x392.png",600,392,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-10.08.51-AM-1-v2ssjz-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/de\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/de\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Reinforcement learning for language agents is growing more complex. Agents now manage multi-turn tool use, long-running contexts, and multi-agent orchestration. The main engineering challenge is connecting existing agent software to training pipelines without breaking how those tools work. NVIDIA\u2019s research team introduced Polar, a rollout framework that lets researchers run reinforcement learning over any agent&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/93275","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/comments?post=93275"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/93275\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media\/93276"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media?parent=93275"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/categories?post=93275"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/tags?post=93275"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}