All notes

AI

May 13, 2026

Needle Distills Gemini Tool Calling into a 26M Parameter Model

Cactus Compute released Needle, a 26M parameter model distilled from Gemini specifically for tool calling, targeting on-device and edge inference workloads where full-scale models are impractical.

Needle is a 26-million parameter model trained to handle tool calling by distilling behavior from Gemini. The project is published by Cactus Compute on GitHub under the name cactus-compute/needle.

The case for a small, task-specialized model here is straightforward. Tool calling is a structured output problem — the model needs to select the right function and emit well-formed JSON arguments. It does not require broad world knowledge or long-context reasoning. A 26M model can learn this mapping effectively if the training signal is clean, and distillation from a capable teacher like Gemini provides exactly that.

For engineers building agentic systems on constrained hardware — mobile, embedded, edge servers, or browser-side runtimes — full-scale models are often impractical. Latency, memory, and cost all push against deploying a 7B or 70B model just to route a function call. Needle targets that gap directly.

The practical implications extend beyond edge deployment. In server-side architectures with high call volume, routing tool-use steps through a 26M model rather than a frontier API reduces both cost and latency for the coordination layer, reserving larger models for reasoning steps that actually require them. This fits a pattern of decomposing agentic pipelines by capability requirement rather than running every step through the same model.

Distillation quality is the key variable. If Needle reliably produces valid tool call payloads across a range of schemas, it becomes a usable drop-in for orchestration layers. The team publishes the model and code on GitHub, which lets engineers evaluate it directly against their own tool schemas before committing.

The model size also puts it within reach of quantization down to formats that run on consumer hardware without dedicated inference infrastructure. That broadens deployment surface considerably for solo founders building lightweight agents.