Ornith-1.0: self-improving open-source models for agentic coding
Ornith-1.0 delivers self-improving open-source models for agentic coding tasks Ornith-1.0 is an open-source suite of self-improving models designed for agentic coding, released under the deepreinforce-ai project. The suite ships as three model checkpoints — a dense 9B model and two Mixture-of-Experts models at 35B and 397B parameters — all sharing the same OpenAI-compatible interface.

Ornith-1.0 delivers self-improving open-source models for agentic coding tasks
Ornith-1.0 is an open-source suite of self-improving models designed for agentic coding, released under the deepreinforce-ai project. The suite ships as three model checkpoints — a dense 9B model and two Mixture-of-Experts models at 35B and 397B parameters — all sharing the same OpenAI-compatible interface.
Ornith-1.0 delivers self-improving open-source models for agentic coding tasks
Ornith-1.0 is an open-source suite of self-improving models designed for agentic coding, available via the deepreinforce-ai project on GitHub. The release comprises three model checkpoints: a dense 9B model and two Mixture-of-Experts (MoE) variants at 35B and 397B parameters respectively.
All three checkpoints expose the same OpenAI-compatible interface and support a 256K context window (262,144 tokens). The dense 9B model is designed to fit on a single 80GB GPU, while the larger MoE checkpoints are sharded across multi-GPU nodes using tensor parallelism.
Ornith-1.0 is a reasoning model, meaning the assistant turn opens with a <think> … </think> block before producing a final answer. The chain-of-thought is returned in a separate reasoning_content field, and the model's tool-call blocks are surfaced as OpenAI-style tool_calls.
How It Works
Each model in the Ornith-1.0 suite is evaluated against size-appropriate baselines using the same harnesses and decoding setup across all three sizes. Benchmarks used include Terminal-Bench 2.1, SWE-bench Verified, SWE-bench Pro, SWE-bench Multilingual, SWE Atlas, NL2Repo, and ClawEval — an agentic code benchmark built over real-user task distributions.
Evaluation configurations vary by benchmark. Terminal-Bench 2.1 runs use a 4-hour timeout with 32 CPU cores and 48GB RAM, averaged over 5 runs. SWE-bench evaluations use the OpenHands harness with a 256K context window, while NL2Repo uses a 400K context window with a 48K output limit and anti-hacking filters.
The recommended sampling parameters for serving are temperature=0.6, top_p=0.95, and top_k=20, though temperature=1.0 is used to reproduce the reported benchmark results. Each checkpoint is published in multiple precision and format variants.
Getting Started
Ornith-1.0 requires recent runtimes to serve, including transformers >= 5.8.1. The project supports serving via vLLM or SGLang, both of which can be configured to stand up an OpenAI-compatible server. The dense 9B checkpoint is described as the easiest option for local testing.
Once a server is running, users can interact with it using any OpenAI-compatible client, including Python and Node.js SDKs or curl pointed at the standard /v1/chat/completions endpoint. Streaming tokens and tool-calling are both supported out of the box.
Story based on discussion on Hacker News.
Enjoyed this tech story? Share it with others!


