AI
May 16, 2026Orthrus Cuts Qwen3 Forward Passes While Preserving Output Distribution
Orthrus applies speculative decoding-style draft-verification to Qwen3, processing more tokens per forward pass without changing the model's output distribution.
Orthrus is a technique that increases token throughput on Qwen3 by batching multiple draft tokens into a single forward pass, then verifying them in parallel. The project claims up to 7.8× tokens per forward pass relative to standard autoregressive decoding, with output distribution kept identical to the base model.
The core idea borrows from speculative decoding's draft-then-verify pattern but applies it directly within a single model rather than requiring a separate smaller draft model. That matters for deployment: you get throughput gains without managing two model weights, two sets of KV caches, or two inference processes.
For engineers running Qwen3 inference at scale, fewer forward passes per sequence means lower latency at the same hardware budget, or the same latency at reduced compute. The gains are most pronounced on longer sequences where verification acceptance rates stay high. On short completions the overhead of batching drafts can narrow the gap.
The release targets Qwen3 specifically, which is relevant given Qwen3's strong benchmark positioning and its growing use as a base for fine-tuned production models. Any technique that preserves output distribution exactly is also fine-tuning and RLHF compatible — you are not trading alignment properties for speed.
The team published the implementation on GitHub under the Orthrus project. The codebase appears designed for direct integration rather than as a standalone inference server, so the practical path is patching it into an existing serving stack. Compatibility with popular inference frameworks like vLLM or SGLang is worth verifying before committing to an integration.
The 7.8× figure represents the upper bound on tokens processed per forward pass, not wall-clock speedup. Actual latency reduction depends on hardware memory bandwidth, batch size, and acceptance rate under your specific workload. Benchmark on your data before deploying.
Source
news.ycombinator.com