{"id":95768,"date":"2026-06-07T17:38:59","date_gmt":"2026-06-07T17:38:59","guid":{"rendered":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/"},"modified":"2026-06-07T17:38:59","modified_gmt":"2026-06-07T17:38:59","slug":"meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b","status":"publish","type":"post","link":"https:\/\/youzum.net\/th\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/","title":{"rendered":"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once.<\/p>\n<p class=\"wp-block-paragraph\">Their answer is <strong>Harness-1<\/strong>, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1198\" height=\"612\" data-attachment-id=\"80362\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/06\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/screenshot-2026-06-06-at-11-13-42-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1.png\" data-orig-size=\"1198,612\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-06-06 at 11.13.42\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-1024x523.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1.png\" alt=\"\" class=\"wp-image-80362\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2606.02373<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What is Harness-1 Actually<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.<\/p>\n<p class=\"wp-block-paragraph\">Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Stateful Harness: What Moves Out of the Policy<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions.<\/p>\n<p class=\"wp-block-paragraph\">That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt.<\/p>\n<p class=\"wp-block-paragraph\">An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads.<\/p>\n<p class=\"wp-block-paragraph\">The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint.<\/p>\n<p class=\"wp-block-paragraph\">One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement.<\/p>\n<p class=\"wp-block-paragraph\">The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How It is Trained<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state.<\/p>\n<p class=\"wp-block-paragraph\">A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL.<\/p>\n<p class=\"wp-block-paragraph\">RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker.<\/p>\n<p class=\"wp-block-paragraph\">The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Benchmark Case<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>\u0e0a\u0e19\u0e34\u0e14<\/th>\n<th>Avg Curated Recall<\/th>\n<th>Avg Trajectory Recall<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Harness-1 (20B)<\/td>\n<td>Open small<\/td>\n<td>0.730<\/td>\n<td>0.807<\/td>\n<\/tr>\n<tr>\n<td>Tongyi DeepResearch 30B<\/td>\n<td>Open small<\/td>\n<td>0.616<\/td>\n<td>0.673<\/td>\n<\/tr>\n<tr>\n<td>Context-1 (20B)<\/td>\n<td>Open small<\/td>\n<td>0.603<\/td>\n<td>0.756<\/td>\n<\/tr>\n<tr>\n<td>Search-R1 (32B)<\/td>\n<td>Open small<\/td>\n<td>0.289<\/td>\n<td>0.289<\/td>\n<\/tr>\n<tr>\n<td>GPT-OSS-20B<\/td>\n<td>Open small<\/td>\n<td>0.262<\/td>\n<td>0.590<\/td>\n<\/tr>\n<tr>\n<td>Qwen3 (32B)<\/td>\n<td>Open small<\/td>\n<td>0.216<\/td>\n<td>0.446<\/td>\n<\/tr>\n<tr>\n<td>Opus-4.6<\/td>\n<td>Frontier<\/td>\n<td>0.764<\/td>\n<td>0.794<\/td>\n<\/tr>\n<tr>\n<td>GPT-5.4<\/td>\n<td>Frontier<\/td>\n<td>0.709<\/td>\n<td>0.752<\/td>\n<\/tr>\n<tr>\n<td>Sonnet-4.6<\/td>\n<td>Frontier<\/td>\n<td>0.688<\/td>\n<td>0.725<\/td>\n<\/tr>\n<tr>\n<td>Kimi-K2.5<\/td>\n<td>Frontier<\/td>\n<td>0.647<\/td>\n<td>0.794<\/td>\n<\/tr>\n<tr>\n<td>GPT-OSS-120B<\/td>\n<td>Frontier<\/td>\n<td>0.496<\/td>\n<td>0.769<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\"><em>Averages across eight benchmarks, from Figure 1 of the paper. Frontier models run as zero-shot retrievers under the Context-1 harness.<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.<\/p>\n<p class=\"wp-block-paragraph\">The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data.<\/p>\n<p class=\"wp-block-paragraph\">Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1202\" height=\"1042\" data-attachment-id=\"80364\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/06\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/screenshot-2026-06-06-at-11-15-46-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.15.46-PM-1.png\" data-orig-size=\"1202,1042\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-06-06 at 11.15.46\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.15.46-PM-1-1024x888.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.15.46-PM-1.png\" alt=\"\" class=\"wp-image-80364\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2606.02373<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Use Cases<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape.<\/p>\n<p class=\"wp-block-paragraph\">One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.<\/p>\n<p class=\"wp-block-paragraph\">A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths and Weaknesses<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Highest average curated recall among the open models tested, and behind only Opus-4.6 overall.<\/li>\n<li>Gains hold on held-out benchmarks, suggesting domain-general search operations.<\/li>\n<li>Trained on 4,352 unique items, far fewer than several baselines.<\/li>\n<li>Open checkpoint and harness code, servable with common runtimes.<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Weaknesses<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>The evidence graph uses regex extraction, not full entity linking.<\/li>\n<li>The verify tool is an LLM proxy that can err on ambiguous claims.<\/li>\n<li>Sentence-BM25 compression may drop context tied to discourse structure.<\/li>\n<li>The research team reports point estimates without full confidence intervals.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Harness-1 is a 20B search agent that moves search bookkeeping into the environment, leaving semantic decisions to the policy.<\/li>\n<li>It hits 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points.<\/li>\n<li>Among the searchers tested, only Opus-4.6 scores higher on average curated recall.<\/li>\n<li>Gains are largest on held-out benchmarks (+17.0 vs +7.9 points), suggesting the learned search operations transfer.<\/li>\n<li>Weights and harness code are public, servable via vLLM, SGLang, or Transformers.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"mtp-h1-head\">\n  <span class=\"mtp-h1-kicker\">Stateful Search Agents<\/span><br \/>\n  <span class=\"mtp-h1-count\"><span class=\"mtp-h1-cur\">1<\/span> \/ 7<\/span>\n<\/div>\n<div class=\"mtp-h1-viewport\">\n<div class=\"mtp-h1-track\">\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">Research Guide<\/div>\n<h2>Harness-1: a 20B search agent with a stateful harness<\/h2>\n<p class=\"mtp-h1-lead\">A retrieval subagent trained with reinforcement learning inside a search harness that holds the bookkeeping.<\/p>\n<div class=\"mtp-h1-chips\">\n        <span class=\"mtp-h1-chip\">20B \u00b7 gpt-oss-20b base<\/span><br \/>\n        <span class=\"mtp-h1-chip\">UIUC \u00b7 UC Berkeley \u00b7 Chroma<\/span><br \/>\n        <span class=\"mtp-h1-chip\">arXiv:2606.02373<\/span><br \/>\n        <span class=\"mtp-h1-chip\">Open weights &amp; code<\/span>\n      <\/div>\n<\/article>\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">The Core Idea<\/div>\n<h3>Split the work between policy and harness<\/h3>\n<p>Most search agents pack search decisions and routine bookkeeping into one growing transcript. Harness-1 separates the two. The paper calls this stateful cognitive offloading.<\/p>\n<div class=\"mtp-h1-two\">\n<div class=\"mtp-h1-box\">\n<div class=\"mtp-h1-lab\">Policy decides<\/div>\n<ul>\n<li>What to search<\/li>\n<li>Which documents to keep<\/li>\n<li>What claims to verify<\/li>\n<li>When to stop<\/li>\n<\/ul><\/div>\n<div class=\"mtp-h1-box\">\n<div class=\"mtp-h1-lab\">Harness maintains<\/div>\n<ul>\n<li>Candidate pool<\/li>\n<li>Curated evidence<\/li>\n<li>Verification records<\/li>\n<li>Context budget<\/li>\n<\/ul><\/div>\n<\/div>\n<\/article>\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">Inside the Harness<\/div>\n<h3>Environment-side working memory<\/h3>\n<ul class=\"mtp-h1-list\">\n<li><b>Candidate pool<\/b> <span>\u2014 compressed, deduplicated documents<\/span><\/li>\n<li><b>Curated set<\/b> <span>\u2014 importance-tagged, capped at 30 (very_high \/ high \/ fair \/ low)<\/span><\/li>\n<li><b>Evidence graph<\/b> <span>\u2014 entities, bridges, and singletons via regex extraction<\/span><\/li>\n<li><b>Verification cache<\/b> <span>\u2014 claim to document to yes\/no verdict<\/span><\/li>\n<li><b>Full-text store<\/b> <span>\u2014 every retrieved chunk kept outside the prompt<\/span><\/li>\n<li><b>Compression<\/b> <span>\u2014 sentence-BM25 keeps the top four sentences<\/span><\/li>\n<\/ul>\n<\/article>\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">Policy Actions<\/div>\n<h3>Eight tools edit the state<\/h3>\n<div class=\"mtp-h1-tools\">\n<div class=\"mtp-h1-tool\">fan_out_search<\/div>\n<div class=\"mtp-h1-tool\">search_corpus<\/div>\n<div class=\"mtp-h1-tool\">grep_corpus<\/div>\n<div class=\"mtp-h1-tool\">read_document<\/div>\n<div class=\"mtp-h1-tool\">review_docs<\/div>\n<div class=\"mtp-h1-tool\">curate<\/div>\n<div class=\"mtp-h1-tool\">verify<\/div>\n<div class=\"mtp-h1-tool\">end_search<\/div>\n<\/div>\n<div class=\"mtp-h1-note\">The first successful search auto-seeds the curated set with eight reranked documents at fair importance. The policy then promotes strong documents and removes weak ones.<\/div>\n<\/article>\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">Training<\/div>\n<h3>SFT to operate the interface, RL to search<\/h3>\n<div class=\"mtp-h1-kv\">\n<div><b>SFT:<\/b> GPT-5.4 teacher inside the harness \u00b7 899 trajectories \u00b7 LoRA rank 32 \u00b7 step-550 checkpoint<\/div>\n<div><b>RL:<\/b> on-policy CISPO \u00b7 SEC queries only \u00b7 40-turn cap \u00b7 terminal reward \u00b7 trained on Tinker<\/div>\n<div><b>Data scale:<\/b> 4,352 unique training items (899 SFT + 3,453 RL)<\/div>\n<\/div>\n<div class=\"mtp-h1-note\">Three trainability requirements: warm-started curation, compact derived-state rendering, and diversity-preserving incentives.<\/div>\n<\/article>\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">Results<\/div>\n<h3>What the numbers show<\/h3>\n<div class=\"mtp-h1-stat\">\n        <span class=\"mtp-h1-bignum\">0.730<\/span><br \/>\n        <span class=\"mtp-h1-statlab\">average curated recall<br \/>across eight benchmarks<\/span>\n      <\/div>\n<div class=\"mtp-h1-kv\">\n<div><b>+11.4 pts<\/b> over the next open subagent, Tongyi DeepResearch 30B<\/div>\n<div>Among the searchers tested, only <b>Opus-4.6<\/b> scores higher on average<\/div>\n<div>Transfer: <b>+17.0<\/b> on held-out vs <b>+7.9<\/b> on source-family (2.2x gap)<\/div>\n<div>Ablation: removing all harness mechanisms drops Recall <b>12.2%<\/b> relative<\/div>\n<\/div>\n<\/article>\n<article class=\"mtp-h1-slide\">\n<div class=\"mtp-h1-eyebrow\">Get Started<\/div>\n<h3>Run it yourself<\/h3>\n<div class=\"mtp-h1-kv\">\n<div><b>Serve:<\/b> vLLM, SGLang, or Transformers<\/div>\n<div><b>Checkpoint:<\/b> pat-jj\/harness-1 (Hugging Face, 21B params, BF16)<\/div>\n<div><b>Code:<\/b> github.com\/pat-jj\/harness-1<\/div>\n<div><b>Paper:<\/b> arXiv:2606.02373<\/div>\n<\/div>\n<div class=\"mtp-h1-note\">Harness-1 returns a curated set of documents for a downstream answering model. It does not answer questions itself.<\/div>\n<\/article><\/div>\n<\/div>\n<div class=\"mtp-h1-nav\">\n  <button class=\"mtp-h1-btn mtp-h1-prev\" aria-label=\"Previous slide\">\u2190 Prev<\/button>\n<div class=\"mtp-h1-dots\"><\/div>\n<p>  <button class=\"mtp-h1-btn mtp-h1-next\" aria-label=\"Next slide\">Next \u2192<\/button>\n<\/p><\/div>\n<div class=\"mtp-h1-foot\">\n  <span>Curated by <b>Marktechpost<\/b> \u2014 practitioner-first AI\/ML research, news, and dev tooling for engineers.<\/span>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2606.02373\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <\/strong><a href=\"https:\/\/huggingface.co\/pat-jj\/harness-1\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Model weights<\/strong> <\/a>and<strong> <a href=\"https:\/\/github.com\/pat-jj\/harness-1\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/06\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\">Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once. Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released. https:\/\/arxiv.org\/pdf\/2606.02373 What is Harness-1 Actually Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY. Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation. The Stateful Harness: What Moves Out of the Policy The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions. That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt. An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads. The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint. One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement. The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three. How It is Trained Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state. A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL. RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker. The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60. The Benchmark Case Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode. Model Type Avg Curated Recall Avg Trajectory Recall Harness-1 (20B) Open small 0.730 0.807 Tongyi DeepResearch 30B Open small 0.616 0.673 Context-1 (20B) Open small 0.603 0.756 Search-R1 (32B) Open small 0.289 0.289 GPT-OSS-20B Open small 0.262 0.590 Qwen3 (32B) Open small 0.216 0.446 Opus-4.6 Frontier 0.764 0.794 GPT-5.4 Frontier 0.709 0.752 Sonnet-4.6 Frontier 0.688 0.725 Kimi-K2.5 Frontier 0.647 0.794 GPT-OSS-120B Frontier 0.496 0.769 Averages across eight benchmarks, from Figure 1 of the paper. Frontier models run as zero-shot retrievers under the Context-1 harness. Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average. The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data. Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees. https:\/\/arxiv.org\/pdf\/2606.02373 Use Cases The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape. One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks. A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy. Strengths and Weaknesses Strengths Highest average curated recall among the open models tested, and behind only Opus-4.6 overall. Gains hold on held-out benchmarks, suggesting domain-general search operations. Trained on 4,352 unique items, far fewer than several baselines. Open checkpoint and harness code, servable with common runtimes. Weaknesses The evidence graph uses regex extraction, not full entity linking. The verify tool is an LLM proxy that can err on ambiguous claims. Sentence-BM25 compression may drop context tied to discourse structure. The research team reports point estimates without full confidence intervals. Key Takeaways Harness-1 is a 20B search agent that moves search bookkeeping into the environment, leaving semantic decisions to the policy. It hits 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points. Among the searchers tested, only Opus-4.6 scores higher on average curated recall. Gains are largest on held-out benchmarks (+17.0 vs +7.9 points), suggesting the learned<\/p>","protected":false},"author":2,"featured_media":95769,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-95768","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/th\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\" \/>\n<meta property=\"og:locale\" content=\"th_TH\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/th\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-07T17:38:59+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 \u0e19\u0e32\u0e17\u0e35\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b\",\"datePublished\":\"2026-06-07T17:38:59+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\"},\"wordCount\":1468,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\",\"url\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\",\"name\":\"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png\",\"datePublished\":\"2026-06-07T17:38:59+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#breadcrumb\"},\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png\",\"width\":1198,\"height\":612},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"th\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/th\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/th\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/","og_locale":"th_TH","og_type":"article","og_title":"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/th\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-07T17:38:59+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin NU","Est. reading time":"7 \u0e19\u0e32\u0e17\u0e35"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b","datePublished":"2026-06-07T17:38:59+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/"},"wordCount":1468,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"th","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/","url":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/","name":"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png","datePublished":"2026-06-07T17:38:59+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#breadcrumb"},"inLanguage":"th","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/"]}]},{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png","width":1198,"height":612},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/meet-harness-1-a-20b-retrieval-subagent-trained-with-reinforcement-learning-inside-a-stateful-search-harness-on-gpt-oss-20b\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"th"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/th\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png",1198,612,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png",1198,612,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png",1198,612,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-300x153.png",300,153,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-1024x523.png",1024,523,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png",1198,612,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu.png",1198,612,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-18x9.png",18,9,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-600x307.png",600,307,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-06-at-11.13.42-PM-1-G9U0Hu-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/th\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/th\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/95768","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/comments?post=95768"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/95768\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media\/95769"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media?parent=95768"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/categories?post=95768"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/tags?post=95768"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}