All notes

AI

May 16, 2026

Orthrus-Qwen3 Delivers Up to 7.8× Tokens Per Forward Pass on Qwen3

Orthrus adapts its dual-sequence batching architecture to Qwen3, achieving up to 7.8× more tokens per forward pass while preserving identical output distribution.

Orthrus-Qwen3 brings the Orthrus speculative decoding approach to Qwen3 models, posting up to 7.8× tokens per forward pass without altering the output distribution. The claim of identical output distribution matters: it rules out quality tradeoffs that typically accompany aggressive inference optimization.

The core idea behind Orthrus is batching two sequences into a single forward pass using a modified attention mask. Each token attends only to the appropriate context for its sequence, so the model processes both in parallel without contamination. The result is better hardware utilization on the same weights, with no retraining required.

Applying this to Qwen3 is a meaningful target. Qwen3 spans a wide capability range — from sub-2B edge models up to larger dense variants — and has seen rapid adoption among builders who want strong multilingual reasoning without routing to a closed API. Orthrus-Qwen3 means those deployments can now extract substantially more throughput per GPU, which directly affects cost-per-token at inference time.

The identical output distribution guarantee is the load-bearing claim here. If verified, it means engineers can swap in Orthrus-Qwen3 behind existing Qwen3 serving stacks and observe no behavioral regression — no prompt sensitivity changes, no distribution shift to audit downstream. That lowers integration risk considerably.

For solo founders running self-hosted inference, the arithmetic is straightforward: the same hardware budget covers more requests. For teams operating at scale, the multiplier compresses GPU-hours meaningfully, particularly on workloads that are decode-bound.

The release is available on GitHub. Engineers evaluating it should benchmark against their own sequence length distributions, since throughput multipliers in speculative and batched decoding approaches vary with request shape. The 7.8× figure represents the upper bound reported by the team, not a floor.