Ornith-1.0: Self-Scaffolding LLMs Are Rewriting Agentic Coding
Discover Ornith-1.0, the open-source family of agentic coding models that's redefining AI software development. Learn how its self-scaffolding reinforcement learning framework, benchmark performance, and autonomous orchestration compare with Claude Opus and other leading coding models. Explore model sizes, deployment options, real-world use cases, and why orchestration intelligence, not just bigger LLMs, is becoming the next frontier in AI coding agents.
Paul Dhaliwal
Founder & Chief Executive Officer ยท Updated Jun 30, 2026ยท15 min read
Why do most coding agents still fall apart the moment a task strays outside the workflow they were built for?
The answer isn't the model; it's the harness wrapped around it. Every retry loop, every tool call sequence, every error recovery path was designed by a human engineer, not learned by the model itself. And when the task deviates, the harness breaks.
DeepReinforce released Ornith-1.0, an open-source family of agentic coding models. Its 397B flagship scored 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, outperforming Claude Opus 4.7 without proprietary code or usage limits (Source).
More notably, Ornith-1.0 is the first open-source coding model trained to generate its own task orchestration instead of relying on human-designed scaffolds.
For AI delivery teams, that distinction matters more than the benchmark number. It signals a shift in where the real leverage in agentic coding lives, and it has infrastructure implications that most coverage has overlooked entirely.
What Is Ornith-1.0?
Ornith-1.0 is a family of four open-source large language models built specifically for agentic coding. Released by DeepReinforce AI, all variants ship under the MIT license with no regional restrictions.
The model family at a glance:
9B Dense: Post-trained on Qwen 3.5; edge and single-GPU deployments
31B Dense: Post-trained on Gemma 4; mid-scale server environments
35B MoE: Post-trained on Qwen 3.5; efficient inference via sparse activation
397B MoE: Post-trained on Qwen 3.5; enterprise-grade autonomous coding pipelines
The name comes from the ancient Greek word for bird, and like a bird building its own nest, Ornith constructs its own scaffolding before it solves a task.
This is not a general-purpose LLM with a coding fine-tune layered on top. Ornith-1.0 was purpose-built for the full agentic coding loop:
Multi-step reasoning across long task horizons
Repository-level code understanding and navigation
Terminal execution and shell command handling
Iterative code repair and debugging
Test-driven patching across multi-file codebases
Every variant is a reasoning model responses open with a <think> block, and the reasoning_content field returns separately through the API. All four models emit standard tool calls, making them compatible with any OpenAI-format agent loop without code changes.
What most coverage skips is the lab behind the release. DeepReinforce is not a first-time team dropping a model card. Their prior work includes:
CUDA-L1: Open-source CUDA kernel optimisation research
IterX: A reinforcement learning optimisation loop built specifically for code agents
Ornith-1.0 is the next step in a sustained RL research direction, not a one-off release. That context matters when evaluating how seriously to take the architecture claims behind it.
The Hidden Cost of Hand-Engineered Agent Harnesses
Before Ornith-1.0 can be properly evaluated, there's a foundational problem worth naming, one that affects every coding agent on the market today, open-source or proprietary.
Most coding agents don't just run on a model. They run on a harness, the orchestration wrapper that sits between the model and the task. A harness defines:
How memory is structured across a long task
When and in what order tools get called
How errors are caught and retried
How the agent replans when a step fails
How execution is sequenced across multi-file changes
Frameworks like OpenHands, Claude Code, and Cursor each ship with their own. Internal dev platforms build and maintain their own.
The problem is what happens after the harness ships.
Every new model integration requires harness updates. Every benchmark change exposes gaps in the orchestration logic. Every new task category, a different repo structure, a different testing framework, and a different failure pattern sends engineers back into the harness to patch it. This is invisible infrastructure debt. It doesn't show up in benchmark tables, but it compounds at scale across every team running agents in production.
There's also a runtime failure mode that rarely gets discussed. When an older coding agent produces a broken plan, it tends to repeat the same error on the next attempt rather than replan from a different angle. The harness has no mechanism to penalise that repetition, so the agent stalls, and a human steps in. The result is shorter autonomous run lengths and higher intervention rates than benchmark scores suggest.
How Ornith-1.0's Self-Improving Training Framework Works
Ornith-1.0's core innovation is a two-stage reinforcement learning loop that jointly optimises two things most models treat separately: the orchestration strategy and the solution itself.
The Two-Stage RL Loop
Each RL iteration runs in two sequential stages:
Stage 1: Scaffold Generation
Conditioned on the task and the scaffold used in the previous iteration, the model proposes a refined execution strategy. That scaffold specifies:
How many reasoning steps to perform
Which tools to call and in what sequence
How memory should be organised across the task
When to retry, and how to structure the retry
How to handle intermediate failures and replan
Stage 2: Solution Rollout
Using the scaffold it just generated, the model executes the actual coding task, reading repository files, editing code, running terminal commands, executing tests, and generating patches.
The reward from the rollout flows back to both stages simultaneously. Scaffolds that produced high-quality solutions survive into the next iteration. Weak orchestration patterns get replaced.
Three Layers of Reward Hacking Defence
Self-authored scaffolds create a specific attack surface on which the model could write orchestration logic that games the verifier by reading visible test files, hardcoding expected outputs, or locating an oracle solution sitting in the environment.
Ornith addresses this with three defence layers stacked in sequence:
Fixed trust boundary: The environment, tool surface, and test isolation are immutable. The model can only evolve its inner policy scaffold: memory layout, error-handling logic, orchestration sequencing. The verification layer is out of reach.
Deterministic monitor: Flags any attempt to read withheld file paths or modify verification scripts. Zero reward is assigned, and the trajectory is excluded from advantage computation entirely.
Frozen LLM judge: Sits on top of the verifier as a veto layer, catching intent-level gaming that stays within the technically allowed tool surface but violates the spirit of the task.
Each layer catches a different class of exploit. Together, they close the attack surface that self-scaffolding opens.
Asynchronous RL with Staleness-Weighted GRPO
Long agentic rollouts create a training problem that shorter tasks don't face. By the time a 30-step trajectory completes, the model weights have already updated, making the early tokens in that trajectory off-policy and potentially misleading as a training signal.
Get insights in your inbox!!
Weekly tips on building smarter apps. Join 8,200+ founders and builders.
No spam. Unsubscribe anytime. We respect your privacy.
Ornith handles this with pipeline-RL and a staleness-weighted loss:
Rollouts are processed asynchronously, so long trajectories don't block the training pipeline
Each token is weighted by how fresh it is relative to the current model state. Older off-policy tokens are downweighted
Tokens past a staleness threshold are dropped entirely rather than included at low weight
Token-level GRPO loss is clipped alongside the staleness weight, which prevents stale data from distorting gradient updates at 397B scale
The result is stable long-horizon training without sacrificing data efficiency a non-trivial engineering problem at the scale Ornith operates at.
Ornith-1.0 Benchmark Results: Numbers and Context
Ornith-1.0's benchmark performance is strong, but the numbers require context that most coverage doesn't provide. Here's what the scores show, what they don't, and what actually matters for teams making deployment decisions.
Performance Across Model Sizes
Model
Terminal-Bench 2.1
SWE-Bench Verified
Ornith-1.0-397B MoE
77.5
82.4
Ornith-1.0-35B MoE
64.2
โ
Ornith-1.0-31B Dense
โ
โ
Ornith-1.0-9B Dense
43.1
69.4
Claude Opus 4.8
85.0
87.6
Claude Opus 4.7
70.3
80.8
GLM-5.2-744B
81.0
โ
Qwen 3.5-397B
53.5
โ
DeepSeek-V4-Pro
โ
80.6
MiniMax M3
โ
โ
Key readings from the table:
The 397B flagship surpasses Claude Opus 4.7 on both benchmarks
The 35B MoE beats Qwen 3.5-397B (53.5) by over 10 points on Terminal-Bench at roughly one-eleventh the parameter count
The 9B Dense outperforms Gemma 4-31B, a model more than three times its size
Claude Opus 4.8 still leads the field at 85.0 and 87.6, respectively
GLM-5.2-744B leads Terminal-Bench 2.1 at 81.0 above Ornith's 77.5
The "state-of-the-art" designation in DeepReinforce's release applies specifically to open-source models of comparable parameter count. It is not a global leaderboard claim, and reading it as one misrepresents what the numbers actually show.
SWE-Bench Verified Limitations: Contamination, Leakage, and What Teams Should Actually Measure
Before SWE-Bench Verified scores are used to make any procurement or deployment decision, there are three findings from 2026 that every engineering team should know:
OpenAI stopped reporting SWE-Bench Verified in February 2026: An internal audit of 138 hard problems found that more than 59.4% had flawed test cases that rejected functionally correct solutions. The audit also found that frontier models could reproduce gold-patch solutions verbatim from just the task ID, a direct fingerprint of training-data contamination.
Solution leakage affects more benchmark instances: Independent research found that models recall correct file paths from training data up to 76% of the time using only the issue description, no code context, and no repository structure required. When recall drives resolution on a third of tasks, the score is measuring memory as much as reasoning.
A 2026 study introducing the SWE-ABS evaluation framework: This study found that 19.71% of patches previously marked as "resolved" by top coding agents were actually semantically incorrect when evaluated with strengthened test suites, highlighting weaknesses in benchmark verification.
The harder, less contaminated alternative is SWE-Bench Pro. Ornith-1.0-397B scores 62.2 there; Claude Opus 4.8 leads at 69.2. That gap evaluated on a benchmark with fewer leakage vectors is a more honest signal for teams comparing models in production contexts.
Cross-Harness Consistency: The Signal That Actually Generalises
One data point buried in DeepReinforce's release notes:
Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 under Terminus-2 and 78.2 under Claude Code 2.1.126, two different agent runtimes, two different harnesses, near-identical results.
That consistency is significant. A model whose scores shift sharply across harnesses is demonstrating benchmark-specific optimisation, not genuine capability. Ornith's cross-harness stability is the practical evidence that self-scaffolding generalises the learned orchestration strategies, transferring across agent runtimes rather than collapsing when the evaluation environment changes.
For production teams, this is a more useful signal than any single headline number.
Choosing the Right Ornith-1.0 Model for Your Stack
Not every team needs a 397B cluster. The Ornith family was designed to cover a range of deployment environments, and the right choice depends on infrastructure constraints, latency requirements, and task complexity. Here's how to think through it.
Deployment Scenario
Model
VRAM Required
Why It Fits
Local/offline / laptop coding
9B Dense (Q4)
~6 GB
Runs on consumer hardware; zero API cost ceiling
Single-GPU dev server
35B MoE (Q5_K_M)
~25 GB
Higher benchmark scores at lower per-token compute than the 9B
Mid-scale internal pipeline
31B Dense (bf16)
~62 GB
Solid all-rounder for constrained server environments without MoE infrastructure
Production agent pipeline
397B MoE (FP8)
~200 GB
Flagship performance; best fit for high-throughput enterprise pipelines
The counter-intuitive pick: why the 35B MoE outperforms the 9B at inference speed
This is the detail most coverage skips entirely. The 35B MoE is not slower than the 9B Dense just because it has more parameters; it's actually faster at inference. The reason is sparse activation: the MoE architecture only activates approximately 3B parameters per token at runtime, regardless of the total parameter count.
The 9B Dense activates all 9B parameters on every token. In practice, the 35B MoE delivers higher benchmark scores at lower per-token compute cost than the 9B, making it the practical sweet spot for most teams running on a single high-end GPU.
Serving options across the stack:
vLLM / SGLang: Recommended for production; OpenAI-compatible endpoint out of the box
Transformers: Requires a reasoning parser enabled to surface the reasoning_content field correctly
Ollama / Unsloth: For GGUF variants; straightforward local deployment with no extra configuration
One deployment consideration worth flagging:
The 397B MoE in FP8 requires approximately 200GB of GPU memory, typically an 8ร 80GB GPU configuration. For teams without that infrastructure, the 35B MoE closes most of the performance gap at a fraction of the hardware cost. The jump from 35B to 397B is meaningful on Terminal-Bench (64.2 โ 77.5) but may not justify the infrastructure overhead depending on task category and throughput requirements.
CONCLUSION
Ornith-1.0 is not just a benchmark milestone. It's evidence that the next competitive axis in agentic coding is orchestration intelligence and that the infrastructure layer sitting above the model is becoming more important, not less, as models get smarter about managing themselves.
The teams positioned to benefit are those building delivery platforms that can route, monitor, and govern autonomous coding pipelines at scale, not just swap in the latest model when a new leaderboard drops.
That's exactly the problem CodeConductor is built to solve.
Ready to build AI delivery infrastructure that keeps pace with where agentic coding is headed? Get started with CodeConductor today!
Ready to Build Without Code?
See how CodeConductor helps enterprises ship faster while staying compliant.
Ornith-1.0 is an open-source agentic coding model family by DeepReinforce, released June 25, 2026. It spans four sizes, 9B to 397B, and is the first coding model trained to generate its own task scaffolds during reinforcement learning rather than relying on human-designed harnesses.
How does Ornith-1.0 compare to Claude Opus 4.7 and 4.8?
The 397B variant surpasses Claude Opus 4.7 on both Terminal-Bench 2.1 (77.5 vs. 70.3) and SWE-Bench Verified (82.4 vs. 80.8). Claude Opus 4.8 still leads the field at 85.0 and 87.6, respectively.
Is Ornith-1.0 truly open source?
Yes, MIT licensed, globally accessible, no regional restrictions. Weights are available on Hugging Face in FP8, GGUF, and bf16 formats across all four model sizes.
Which Ornith-1.0 model size is best for most teams?
The 35B MoE is the practical sweet spot; it's faster than the 9B Dense at inference due to sparse activation, requires only ~25GB VRAM, and significantly outperforms models several times its parameter size on benchmark tasks.
Does Ornith-1.0 work with existing agent frameworks?
Yes. All variants are confirmed compatible with OpenHands, Claude Code 2.1, OpenClaw, and Hermes Agent out of the box. Every model exposes an OpenAI-compatible endpoint, so existing agent loops require no code changes.
Key Takeaways
4 essential insights
Prioritize self-scaffolding models to reduce fragile, hand-engineered agent harnesses.
Evaluate agentic coding by orchestration robustness, not just benchmark scores.
Choose Ornith variants to match deployment scale and inference efficiency needs.
Plan infrastructure for tool-call compatibility and long-horizon, repo-level task loops.
Written by
Paul Dhaliwal
Founder & Chief Executive Officer
Paul Dhaliwal is a tech innovator and Founder of CodeConductor, an open-source no/low-code platform. With 10+ years of experience in AI and scalable development, Paul focuses on crafting intelligent solutions that drive real-world value. A firm believer in the mantra "Eat, Sleep, Code, Repeat," he balances his passion for software with a love for travel and family.
โก
Build your app
No coding. No designers. Just describe what you want and watch AI build it.