CodeConductor
Product
Solutions
Resources
Company
CodeConductor

Product-focused AI platform for scalable apps, agents and everything in between.

Platform

  • App Studio
  • Copilot Studio
  • Governance
  • Architecture
  • Integrations
  • Pricing

Solutions

  • For CIOs
  • For Engineering
  • For Business Units
  • Automate SDLC
  • Base44 Migration
  • Lovable Migration

Resources

  • Documentation
  • Customer Stories
  • Trust Center
  • Blog

Company

  • About Us
  • Team
  • Careers
  • Partners
  • Contact

ยฉ 2026 CodeConductor Inc. All rights reserved.

Privacy PolicyTerms & ConditionsCookie PolicyDo Not Sell or Share My Personal Information
  1. Blog
  2. AI Coding
  3. Ornith-1.0: Self-Scaffolding LLMs Are Rewriting Agentic Coding
AI Coding

Ornith-1.0: Self-Scaffolding LLMs Are Rewriting Agentic Coding

Discover Ornith-1.0, the open-source family of agentic coding models that's redefining AI software development. Learn how its self-scaffolding reinforcement learning framework, benchmark performance, and autonomous orchestration compare with Claude Opus and other leading coding models. Explore model sizes, deployment options, real-world use cases, and why orchestration intelligence, not just bigger LLMs, is becoming the next frontier in AI coding agents.

Paul Dhaliwal
Paul Dhaliwal
Founder & Chief Executive Officer ยท Updated Jun 30, 2026ยท15 min read
Ornith-1.0: Self-Scaffolding LLMs Are Rewriting Agentic Coding

Why do most coding agents still fall apart the moment a task strays outside the workflow they were built for? 

The answer isn't the model; it's the harness wrapped around it. Every retry loop, every tool call sequence, every error recovery path was designed by a human engineer, not learned by the model itself. And when the task deviates, the harness breaks.

DeepReinforce released Ornith-1.0, an open-source family of agentic coding models. Its 397B flagship scored 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, outperforming Claude Opus 4.7 without proprietary code or usage limits (Source).

More notably, Ornith-1.0 is the first open-source coding model trained to generate its own task orchestration instead of relying on human-designed scaffolds.

For AI delivery teams, that distinction matters more than the benchmark number. It signals a shift in where the real leverage in agentic coding lives, and it has infrastructure implications that most coverage has overlooked entirely.

What Is Ornith-1.0?

Ornith-1.0 is a family of four open-source large language models built specifically for agentic coding. Released by DeepReinforce AI, all variants ship under the MIT license with no regional restrictions.

The model family at a glance:

  • 9B Dense: Post-trained on Qwen 3.5; edge and single-GPU deployments

  • 31B Dense: Post-trained on Gemma 4; mid-scale server environments

  • 35B MoE: Post-trained on Qwen 3.5; efficient inference via sparse activation

  • 397B MoE: Post-trained on Qwen 3.5; enterprise-grade autonomous coding pipelines

The name comes from the ancient Greek word for bird, and like a bird building its own nest, Ornith constructs its own scaffolding before it solves a task.

This is not a general-purpose LLM with a coding fine-tune layered on top. Ornith-1.0 was purpose-built for the full agentic coding loop:

  • Multi-step reasoning across long task horizons

  • Repository-level code understanding and navigation

  • Terminal execution and shell command handling

  • Iterative code repair and debugging

  • Test-driven patching across multi-file codebases

Every variant is a reasoning model responses open with a <think> block, and the reasoning_content field returns separately through the API. All four models emit standard tool calls, making them compatible with any OpenAI-format agent loop without code changes.

What most coverage skips is the lab behind the release. DeepReinforce is not a first-time team dropping a model card. Their prior work includes:

  • CUDA-L1: Open-source CUDA kernel optimisation research

  • IterX: A reinforcement learning optimisation loop built specifically for code agents

Ornith-1.0 is the next step in a sustained RL research direction, not a one-off release. That context matters when evaluating how seriously to take the architecture claims behind it.

The Hidden Cost of Hand-Engineered Agent Harnesses

Before Ornith-1.0 can be properly evaluated, there's a foundational problem worth naming, one that affects every coding agent on the market today, open-source or proprietary.

Most coding agents don't just run on a model. They run on a harness, the orchestration wrapper that sits between the model and the task. A harness defines:

  • How memory is structured across a long task

  • When and in what order tools get called

  • How errors are caught and retried

  • How the agent replans when a step fails

  • How execution is sequenced across multi-file changes

Arra Oracle Alternative: Full MCP Memory Comparison
RecommendedยทAI Coding
Arra Oracle Alternative: Full MCP Memory Comparison

Looking for the best Arra Oracle alternative in 2026? Harmony MCP gives AI agents faster, more accurate, and token-efficient memory with deterministic context, token budgeting, landmark expansion, model-aware formatting, and production-ready MCP workflows.

Read article

Frameworks like OpenHands, Claude Code, and Cursor each ship with their own. Internal dev platforms build and maintain their own.

The problem is what happens after the harness ships.

Every new model integration requires harness updates. Every benchmark change exposes gaps in the orchestration logic. Every new task category, a different repo structure, a different testing framework, and a different failure pattern sends engineers back into the harness to patch it. This is invisible infrastructure debt. It doesn't show up in benchmark tables, but it compounds at scale across every team running agents in production.

There's also a runtime failure mode that rarely gets discussed. When an older coding agent produces a broken plan, it tends to repeat the same error on the next attempt rather than replan from a different angle. The harness has no mechanism to penalise that repetition, so the agent stalls, and a human steps in. The result is shorter autonomous run lengths and higher intervention rates than benchmark scores suggest.

How Ornith-1.0's Self-Improving Training Framework Works

Ornith-1.0's core innovation is a two-stage reinforcement learning loop that jointly optimises two things most models treat separately: the orchestration strategy and the solution itself.

The Two-Stage RL Loop

Each RL iteration runs in two sequential stages:

Stage 1: Scaffold Generation 

Conditioned on the task and the scaffold used in the previous iteration, the model proposes a refined execution strategy. That scaffold specifies:

  • How many reasoning steps to perform

  • Which tools to call and in what sequence

  • How memory should be organised across the task

  • When to retry, and how to structure the retry

  • How to handle intermediate failures and replan

Stage 2: Solution Rollout 

Using the scaffold it just generated, the model executes the actual coding task, reading repository files, editing code, running terminal commands, executing tests, and generating patches.

The reward from the rollout flows back to both stages simultaneously. Scaffolds that produced high-quality solutions survive into the next iteration. Weak orchestration patterns get replaced.

Three Layers of Reward Hacking Defence

Self-authored scaffolds create a specific attack surface on which the model could write orchestration logic that games the verifier by reading visible test files, hardcoding expected outputs, or locating an oracle solution sitting in the environment.

Ornith addresses this with three defence layers stacked in sequence:

  • Fixed trust boundary: The environment, tool surface, and test isolation are immutable. The model can only evolve its inner policy scaffold: memory layout, error-handling logic, orchestration sequencing. The verification layer is out of reach.

  • Deterministic monitor: Flags any attempt to read withheld file paths or modify verification scripts. Zero reward is assigned, and the trajectory is excluded from advantage computation entirely.

  • Frozen LLM judge: Sits on top of the verifier as a veto layer, catching intent-level gaming that stays within the technically allowed tool surface but violates the spirit of the task.

Each layer catches a different class of exploit. Together, they close the attack surface that self-scaffolding opens.

Asynchronous RL with Staleness-Weighted GRPO

Long agentic rollouts create a training problem that shorter tasks don't face. By the time a 30-step trajectory completes, the model weights have already updated, making the early tokens in that trajectory off-policy and potentially misleading as a training signal.

Get insights in your inbox!!

Weekly tips on building smarter apps. Join 8,200+ founders and builders.

No spam. Unsubscribe anytime. We respect your privacy.

Ornith handles this with pipeline-RL and a staleness-weighted loss:

  • Rollouts are processed asynchronously, so long trajectories don't block the training pipeline

  • Each token is weighted by how fresh it is relative to the current model state. Older off-policy tokens are downweighted

  • Tokens past a staleness threshold are dropped entirely rather than included at low weight

  • Token-level GRPO loss is clipped alongside the staleness weight, which prevents stale data from distorting gradient updates at 397B scale

The result is stable long-horizon training without sacrificing data efficiency a non-trivial engineering problem at the scale Ornith operates at.

Ornith-1.0 Benchmark Results: Numbers and Context

Ornith-1.0's benchmark performance is strong, but the numbers require context that most coverage doesn't provide. Here's what the scores show, what they don't, and what actually matters for teams making deployment decisions.

Performance Across Model Sizes

Model

Terminal-Bench 2.1

SWE-Bench Verified

Ornith-1.0-397B MoE

77.5

82.4

Ornith-1.0-35B MoE

64.2

โ€”

Ornith-1.0-31B Dense

โ€”

โ€”

Ornith-1.0-9B Dense

43.1

69.4

Claude Opus 4.8

85.0

87.6

Claude Opus 4.7

70.3

80.8

GLM-5.2-744B

81.0

โ€”

Qwen 3.5-397B

53.5

โ€”

DeepSeek-V4-Pro

โ€”

80.6

MiniMax M3

โ€”

โ€”

Key readings from the table:

  • The 397B flagship surpasses Claude Opus 4.7 on both benchmarks

  • The 35B MoE beats Qwen 3.5-397B (53.5) by over 10 points on Terminal-Bench at roughly one-eleventh the parameter count

  • The 9B Dense outperforms Gemma 4-31B, a model more than three times its size

  • Claude Opus 4.8 still leads the field at 85.0 and 87.6, respectively

  • GLM-5.2-744B leads Terminal-Bench 2.1 at 81.0 above Ornith's 77.5

The "state-of-the-art" designation in DeepReinforce's release applies specifically to open-source models of comparable parameter count. It is not a global leaderboard claim, and reading it as one misrepresents what the numbers actually show.

SWE-Bench Verified Limitations: Contamination, Leakage, and What Teams Should Actually Measure

Before SWE-Bench Verified scores are used to make any procurement or deployment decision, there are three findings from 2026 that every engineering team should know:

OpenAI stopped reporting SWE-Bench Verified in February 2026: An internal audit of 138 hard problems found that more than 59.4% had flawed test cases that rejected functionally correct solutions. The audit also found that frontier models could reproduce gold-patch solutions verbatim from just the task ID, a direct fingerprint of training-data contamination.

You Might Also LikeยทAI Coding
Best Graphify Alternative for Token-Efficient AI Agents

AI agents need more than a project graph to work accurately at scale. Harmony MCP is the Graphify alternative for teams that need faster context retrieval, token budgeting, factual memory, and production-ready MCP workflows.

Continue reading

Solution leakage affects more benchmark instances: Independent research found that models recall correct file paths from training data up to 76% of the time using only the issue description, no code context, and no repository structure required. When recall drives resolution on a third of tasks, the score is measuring memory as much as reasoning.

A 2026 study introducing the SWE-ABS evaluation framework: This study found that 19.71% of patches previously marked as "resolved" by top coding agents were actually semantically incorrect when evaluated with strengthened test suites, highlighting weaknesses in benchmark verification.

The harder, less contaminated alternative is SWE-Bench Pro. Ornith-1.0-397B scores 62.2 there; Claude Opus 4.8 leads at 69.2. That gap evaluated on a benchmark with fewer leakage vectors is a more honest signal for teams comparing models in production contexts.

Cross-Harness Consistency: The Signal That Actually Generalises

One data point buried in DeepReinforce's release notes:

Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 under Terminus-2 and 78.2 under Claude Code 2.1.126, two different agent runtimes, two different harnesses, near-identical results.

That consistency is significant. A model whose scores shift sharply across harnesses is demonstrating benchmark-specific optimisation, not genuine capability. Ornith's cross-harness stability is the practical evidence that self-scaffolding generalises the learned orchestration strategies, transferring across agent runtimes rather than collapsing when the evaluation environment changes.

For production teams, this is a more useful signal than any single headline number.

Choosing the Right Ornith-1.0 Model for Your Stack

Not every team needs a 397B cluster. The Ornith family was designed to cover a range of deployment environments, and the right choice depends on infrastructure constraints, latency requirements, and task complexity. Here's how to think through it.

Deployment Scenario

Model

VRAM Required

Why It Fits

Local/offline / laptop coding

9B Dense (Q4)

~6 GB

Runs on consumer hardware; zero API cost ceiling

Single-GPU dev server

35B MoE (Q5_K_M)

~25 GB

Higher benchmark scores at lower per-token compute than the 9B

Mid-scale internal pipeline

31B Dense (bf16)

~62 GB

Solid all-rounder for constrained server environments without MoE infrastructure

Production agent pipeline

397B MoE (FP8)

~200 GB

Flagship performance; best fit for high-throughput enterprise pipelines

The counter-intuitive pick: why the 35B MoE outperforms the 9B at inference speed

This is the detail most coverage skips entirely. The 35B MoE is not slower than the 9B Dense just because it has more parameters; it's actually faster at inference. The reason is sparse activation: the MoE architecture only activates approximately 3B parameters per token at runtime, regardless of the total parameter count. 

TrendingAI Coding
Harmony MCP AI Agent Memory: Cutting AI Coding Costs

Harmony is an MCP-based AI coding agent memory layer that helps tools like Claude Code, Cursor, and Windsurf understand codebase context faster. Learn how Harmony reduces repeated file discovery, cuts token waste, improves coding-agent accuracy, and helps developers get more value from their AI coding workflows.

18 min readRead more

The 9B Dense activates all 9B parameters on every token. In practice, the 35B MoE delivers higher benchmark scores at lower per-token compute cost than the 9B, making it the practical sweet spot for most teams running on a single high-end GPU.

Serving options across the stack:

  • vLLM / SGLang: Recommended for production; OpenAI-compatible endpoint out of the box

  • Transformers: Requires a reasoning parser enabled to surface the reasoning_content field correctly

  • Ollama / Unsloth: For GGUF variants; straightforward local deployment with no extra configuration

One deployment consideration worth flagging:

The 397B MoE in FP8 requires approximately 200GB of GPU memory, typically an 8ร— 80GB GPU configuration. For teams without that infrastructure, the 35B MoE closes most of the performance gap at a fraction of the hardware cost. The jump from 35B to 397B is meaningful on Terminal-Bench (64.2 โ†’ 77.5) but may not justify the infrastructure overhead depending on task category and throughput requirements.

CONCLUSION

Ornith-1.0 is not just a benchmark milestone. It's evidence that the next competitive axis in agentic coding is orchestration intelligence and that the infrastructure layer sitting above the model is becoming more important, not less, as models get smarter about managing themselves.

The teams positioned to benefit are those building delivery platforms that can route, monitor, and govern autonomous coding pipelines at scale, not just swap in the latest model when a new leaderboard drops.

That's exactly the problem CodeConductor is built to solve.

Ready to build AI delivery infrastructure that keeps pace with where agentic coding is headed? Get started with CodeConductor today!

Ready to Build Without Code?
See how CodeConductor helps enterprises ship faster while staying compliant.
Get Started Now

FAQs

What is Ornith-1.0? 

Ornith-1.0 is an open-source agentic coding model family by DeepReinforce, released June 25, 2026. It spans four sizes, 9B to 397B, and is the first coding model trained to generate its own task scaffolds during reinforcement learning rather than relying on human-designed harnesses.

How does Ornith-1.0 compare to Claude Opus 4.7 and 4.8? 

The 397B variant surpasses Claude Opus 4.7 on both Terminal-Bench 2.1 (77.5 vs. 70.3) and SWE-Bench Verified (82.4 vs. 80.8). Claude Opus 4.8 still leads the field at 85.0 and 87.6, respectively.

Is Ornith-1.0 truly open source? 

Yes, MIT licensed, globally accessible, no regional restrictions. Weights are available on Hugging Face in FP8, GGUF, and bf16 formats across all four model sizes.

Which Ornith-1.0 model size is best for most teams? 

The 35B MoE is the practical sweet spot; it's faster than the 9B Dense at inference due to sparse activation, requires only ~25GB VRAM, and significantly outperforms models several times its parameter size on benchmark tasks.

Does Ornith-1.0 work with existing agent frameworks? 

Yes. All variants are confirmed compatible with OpenHands, Claude Code 2.1, OpenClaw, and Hermes Agent out of the box. Every model exposes an OpenAI-compatible endpoint, so existing agent loops require no code changes.

Key Takeaways

4 essential insights

Prioritize self-scaffolding models to reduce fragile, hand-engineered agent harnesses.
Evaluate agentic coding by orchestration robustness, not just benchmark scores.
Choose Ornith variants to match deployment scale and inference efficiency needs.
Plan infrastructure for tool-call compatibility and long-horizon, repo-level task loops.
Paul Dhaliwal
Written by
Paul Dhaliwal
Founder & Chief Executive Officer

Paul Dhaliwal is a tech innovator and Founder of CodeConductor, an open-source no/low-code platform. With 10+ years of experience in AI and scalable development, Paul focuses on crafting intelligent solutions that drive real-world value. A firm believer in the mantra "Eat, Sleep, Code, Repeat," he balances his passion for software with a love for travel and family.

โšก

Build your app

No coding. No designers. Just describe what you want and watch AI build it.

Try CodeConductor Explore the Platform
15 min left
More to Explore

Keep Reading

Arra Oracle Alternative: Full MCP Memory Comparison
AI Coding
Arra Oracle Alternative: Full MCP Memory Comparison
Jun 24, 202614 min read
Best Graphify Alternative for Token-Efficient AI Agents
AI Coding
Best Graphify Alternative for Token-Efficient AI Agents
Jun 22, 202614 min read
Harmony MCP AI Agent Memory: Cutting AI Coding Costs
AI Coding
Harmony MCP AI Agent Memory: Cutting AI Coding Costs
Jun 19, 202618 min read