DeepReinforce has released Ornith-1.0, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5. Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size. TL;DR Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5. The model learns its own scaffold during RL, jointly optimizing the harness and the solution. Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B. Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking. What is Ornith-1.0? Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving. Each model is a reasoning model. Replies open with a <think> block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate reasoning_content field. The models also emit well-formed tool calls for agent loops. Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes. Interactive Explainer </button> <button class=”btn gho” id=”resetBtn”>Reset</button> </div> <div class=”stepout” id=”stepOut”>Step 0 — untrained policy with a fixed, hand-written harness.</div> </div> <!– PANEL 2: BENCH –> <div class=”panel” data-panel=”bench”> <div class=”lead”>Vendor-reported scores from DeepReinforce. Pick a model tier and a benchmark. Ornith is highlighted in green. Higher is better.</div> <div class=”seg”><span class=”lab”>Model tier</span> <div class=”chip on” data-tier=”t397″>397B flagship</div> <div class=”chip” data-tier=”t35″>35B MoE</div> <div class=”chip” data-tier=”t9″>9B dense</div> </div> <div class=”seg” id=”benchChips”><span class=”lab”>Benchmark</span></div> <div class=”chart” id=”chart”></div> <div class=”foot-note” id=”benchNote”></div> </div> <!– PANEL 3: DEFENSES –> <div class=”panel” data-panel=”def”> <div class=”lead”>A model that writes its own scaffold could cheat the verifier. DeepReinforce describes three defense layers. Tap each to expand.</div> <div class=”layers”> <div class=”layer open”><div class=”lh”><span class=”num”>1</span><span class=”lt”>Fixed trust boundary</span><span class=”more”>tap</span></div><div class=”lb”>The environment, tool surface, and test isolation are immutable and outside the model’s reach. The model evolves only its inner policy scaffold — memory, error-handling, and orchestration logic.</div></div> <div class=”layer”><div class=”lh”><span class=”num”>2</span><span class=”lt”>Deterministic monitor</span><span class=”more”>tap</span></div><div class=”lb”>A rule-based monitor flags any attempt to read withheld paths, modify verification scripts, or invoke unsanctioned tools. Such trajectories get zero reward and are excluded from the advantage computation.</div></div> <div class=”layer”><div class=”lh”><span class=”num”>3</span><span class=”lt”>Frozen LLM judge</span><span class=”more”>tap</span></div><div class=”lb”>Because intent-level gaming can happen inside the allowed tool surface, a frozen LLM judge acts as a veto on top of the verifier — not as the primary reward signal.</div></div> </div> </div> <div class=”ftr”><span>Source: <a href=”https://deep-reinforce.com/ornith_1_0.html” target=”_blank” rel=”noopener”>deep-reinforce.com</a> · MIT licensed · numbers vendor-reported</span><span><b>Marktechpost</b> · AI Dev Signals</span></div> <script> (function(){ var root=document.getElementById(‘mtp-ornith-demo’); /* tabs */ root.querySelectorAll(‘.tab’).forEach(function(t){ t.addEventListener(‘click’,function(){ root.querySelectorAll(‘.tab’).forEach(function(x){x.classList.remove(‘on’)}); root.querySelectorAll(‘.panel’).forEach(function(x){x.classList.remove(‘on’)}); t.classList.add(‘on’); root.querySelector(‘.panel[data-panel=”‘+t.dataset.p+’”]’).classList.add(‘on’); resize(); }); }); /* loop sim */ var step=0,reward=0.08,timer=null; var scaffs=[ ‘Baseline harness: linear retries, no memory.’, ‘Adds scratchpad memory across tool calls.’, ‘Adds error-triage branch before re-edit.’, ‘Reorders: read tests, then plan, then patch.’, ‘Caches sub-results; prunes dead branches.’, ‘Task-specific orchestration emerges automatically.’]; var outs=[ ‘Fixed harness, no learning yet.’, ‘Fewer redundant file reads observed.’, ‘Recovers from failed edits more often.’, ‘Higher first-pass test success.’, ‘Shorter trajectories, same accuracy.’, ‘Stable high-reward scaffold selected.’]; var nodes=root.querySelectorAll(‘.node’); function lightSeq(cb){ var i=0;nodes.forEach(function(n){n.classList.remove(‘act’)}); var iv=setInterval(function(){ nodes.forEach(function(n){n.classList.remove(‘act’)}); nodes[i].classList.add(‘act’);i++; if(i>=nodes.length){clearInterval(iv);setTimeout(function(){nodes.forEach(function(n){n.classList.remove(‘act’)});cb&&cb();},260);} },220); } function doStep(){ if(step>=5){return;} step++; lightSeq(function(){ reward=[0.08,0.27,0.43,0.58,0.69,0.77][step]; root.querySelector(‘#rFill’).style.width=(reward*100)+’%’; root.querySelector(‘#rVal’).textContent=reward.toFixed(2); root.querySelector(‘#scaffTxt’).textContent=scaffs[step]; root.querySelector(‘#outTxt’).textContent=outs[step]; root.querySelector(‘#stepOut’).innerHTML=’Step ‘+step+’ — <b>scaffold mutated</b>; reward propagated to both stages.’; resize(); }); } root.querySelector(‘#stepBtn’).addEventListener(‘click’,doStep); root.querySelector(‘#autoBtn’).addEventListener(‘click’,function(){ if(timer){clearInterval(timer);timer=null;this.textContent=’Auto-run ‘;return;} this.textContent=’Pause ‘;var b=this; timer=setInterval(function(){if(step>=5){clearInterval(timer);timer=null;b.textContent=’Auto-run ‘;}else{doStep();}},1400); }); root.querySelector(‘#resetBtn’).addEventListener(‘click’,function(){ if(timer){clearInterval(timer);timer=null;root.querySelector(‘#autoBtn’).textContent=’Auto-run ‘;} step=0;reward=0.08; root.querySelector(‘#rFill’).style.width=’8%’; root.querySelector(‘#rVal’).textContent=’0.08′; root.querySelector(‘#scaffTxt’).textContent=scaffs[0]; root.querySelector(‘#outTxt’).textContent=’Press “Run training step” to begin.’; root.querySelector(‘#stepOut’).innerHTML=’Step 0 — untrained policy with a fixed, hand-written harness.’; resize(); }); /* benchmark data (vendor-reported) */ var BENCHES=[‘Terminal-Bench 2.1′,’SWE-Bench Verified’,’SWE-Bench Pro’,’SWE-Bench Multilingual’,’NL2Repo’,’ClawEval Avg’]; var DATA={ t397:{label:’Ornith-1.0-397B’,hero:’Ornith-1.0-397B’, models:[‘Ornith-1.0-397B’,’Qwen3.5-397B’,’Qwen3.7-Max’,’GLM-5.2-744B’,’Minimax-M3-428B’,’DeepSeek-V4-Pro-1.6T’,’Claude Opus 4.7′,’Claude Opus 4.8′], vals:[[77.5,53.5,73.5,81.0,64,64,70.3,85],[82.4,76.4,80.4,null,null,80.6,80.8,87.6],[62.2,51.6,60.6,62.1,59,55.4,64.3,69.2],[78.9,69.3,78.3,null,null,76.2,null,null],[48.2,36.8,47.2,48.9,42.1,null,null,69.7],[77.1,70.7,65.2,null,null,75.8,78.2,null]]}, t35:{label:’Ornith-1.0-35B-A3B’,hero:’Ornith-1.0-35B-A3B’, models:[‘Ornith-1.0-35B-A3B’,’Qwen3.5-35B-A3B’,’Qwen3.6-35B-A3B’,’Gemma4-31B’,’Qwen3.5-397B’], vals:[[64.2,41.4,52.5,42.1,53.5],[75.6,70,73.4,52,76.4],[50.4,44.6,49.5,35.7,51.6],[69.3,60.3,67.2,51.7,69.3],[34.6,20.5,29.4,15.5,36.8],[69.8,65.4,68.7,48.5,70.7]]}, t9:{label:’Ornith-1.0-9B’,hero:’Ornith-1.0-9B’, models:[‘Ornith-1.0-9B’,’Qwen3.5-9B’,’Qwen3.5-35B-A3B’,’Gemma4-12B’,’Gemma4-31B’], vals:[[43.1,21.3,41.4,21,42.1],[69.4,53.2,70,44.2,52],[42.9,31.3,44.6,27.6,35.7],[52,39.7,60.3,32.5,51.7],[27.2,16.2,20.5,10.3,15.5],[63.1,53.2,65.4,32.5,48.5]]} }; var curTier=’t397′,curB=0; var bchips=root.querySelector(‘#benchChips’); BENCHES.forEach(function(b,i){ var c=document.createElement(‘div’);c.className=’chip’+(i===0?’ on’:”);c.textContent=b;c.dataset.b=i; c.addEventListener(‘click’,function(){curB=i;bchips.querySelectorAll(‘.chip’).forEach(function(x){x.classList.remove(‘on’)});c.classList.add(‘on’);draw();}); bchips.appendChild(c); }); root.querySelectorAll(‘.chip[data-tier]’).forEach(function(c){ c.addEventListener(‘click’,function(){curTier=c.dataset.tier;root.querySelectorAll(‘.chip[data-tier]’).forEach(function(x){x.classList.remove(‘on’)});c.classList.add(‘on’);draw();}); }); function draw(){ var d=DATA[curTier];var row=d.vals[curB];var chart=root.querySelector(‘#chart’);chart.innerHTML=”; var max=Math.max.apply(null,row.filter(function(v){return v!=null})); d.models.forEach(function(m,i){ var v=row[i];var hero=(m===d.hero); var div=document.createElement(‘div’);div.className=’row’+(hero?’ hero’:”)+(v==null?’ na’:”); div.innerHTML='<div class=”nm”>’+m+'</div><div class=”bt”><div class=”bf”></div></div><div class=”vl”>’+(v==null?’n/a’:v)+'</div>’; chart.appendChild(div); (function(bf,val){setTimeout(function(){bf.style.width=(val==null?0:(val/max*100))+’%’;},40);})(div.querySelector(‘.bf’),v); }); root.querySelector(‘#benchNote’).textContent=’Benchmark: ‘+BENCHES[curB]+’. Bars scaled to the highest score shown. “n/a” = not reported by the vendor. Self-reported, not independently verified.’; resize(); } draw(); /* defenses accordion */ root.querySelectorAll(‘.layer’).forEach(function(l){ l.addEventListener(‘click’,function(){l.classList.toggle(‘open’);resize();}); }); /* auto-resize for WordPress iframe */ function resize(){ try{ var h=root.offsetHeight+40; if(window.parent){window.parent.postMessage({type:’mtp-ornith-height’,height:h},’*’);} }catch(e){} } window.addEventListener(‘load’,resize); setTimeout(resize,300); window.addEventListener(‘resize’,resize); })(); </script> </div> ” style=”width:100%;border:0;display:block;min-height:600px;overflow:hidden” height=”600″ scrolling=”no” loading=”lazy” title=”Ornith-1.0 Interactive Explainer”> The Self-Scaffolding Idea Most coding agents rely on a scaffold, also called a harness. A scaffold wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category. Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. Each RL step runs in two stages. First, the model reads the task and its previous scaffold. It then proposes a refined scaffold. Second, it uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages. So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design. Training also runs asynchronously, using a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them past a threshold. The optimization uses a token-level GRPO objective. Guarding Against Reward Hacking Letting a model write its own scaffold invites reward hacking. A scaffold could read visible test files and hardcode expected outputs. It could also copy an oracle solution sitting in the environment. DeepReinforce team describes three defense layers. The outer trust