Local LLM Task Routing Decision: The Right Tool (Part 5)

The final routing decision for local LLM deployment in a network lab. Matching specific AI models to voice loops, background tasks, and critical operations.

Local LLM Task Routing Decision: The Right Tool (Part 5)

The Routing Decision


This is the close. Four runs, six local models, one cloud API reference. The question was always the same: which model belongs where in a system that actually has to work?

Here is the answer.

What the Series Was

A field study, not a leaderboard. The goal was never to rank models. It was to make a defensible routing decision for a lab domain intelligence system running on a Dell Precision 3571 with an NVIDIA T600 and 32GB of RAM. Every task was drawn from real lab operations. Every finding was something that changed the architecture.

Parts 1 and 2 established the baseline and corrected the test design. Part 3 characterized the GPU. Part 4 changed the evaluation axis entirely. Each part produced one or two findings that revised what I thought I knew at the start.

The benchmark is done. The routing table is what comes next.


The Routing Table

Task Model Lane Rationale
Containerlab topology generation Qwen 2.5 Coder 7B Background Best T3 pass rate, 14 tok/s, coder specialization earns its slot
gNMI scripting / Python automation Qwen 2.5 Coder 7B Background T5 pass, correct path syntax, generator pattern
Session log summarization Gemma 3 12B Background 24/32 prose, 7 tok/s, best narrative reasoning
BGP incident narrative Gemma 3 12B Background W2: only local model with correct scope and timeline
Status brief / structured output Gemma 3 4B Voice / Background Fast enough for voice generation, acceptable accuracy on format
Voice response (real-time) Gemma 3 4B Voice 27-28 tok/s GPU, 5-6s for 150 tokens. Borderline viable

Three models do not appear in this table. That is intentional.


What Got Eliminated

Llama 3.1 8B came into the series as the incumbent general-purpose 8B. It exits without a slot.

On code tasks (Parts 1-3), it performed at the pack average. Not bad, not distinguishing. On prose tasks (Part 4), it failed where it mattered: calculated 23 minutes for a 66-second reconvergence event, dropped the srl2 prefix-limit subplot entirely from a scope paragraph, and used a WARNING bullet where the spec required ACTION. These aren't edge cases. They are the kinds of errors that make a summary misleading rather than useful.

Llama 3.1 8B stays available in Ollama. It is not being removed. But it doesn't route to anything in the autonomous system.

Mistral Nemo 12B was the new entrant in Part 4. It tied Qwen 2.5 Coder 7B at 23/32. That score isn't bad. The problem is that Gemma 3 12B scored 24/32 at roughly the same throughput (6-7 tok/s), and Gemma 12B was already in the lineup. It is a competent 12B model that arrived in a field where every slot was already claimed by a model with a specific advantage.


The Sovereignty Mandate (Severing the Tether)

You will notice Claude Haiku 4.5 is absent from the routing table.

Haiku won the benchmark. It scored 29/32 on prose. It generated a 1,400-token topology in 6 seconds. It suffered no VRAM constraints and no thermal throttling. It was the most capable model in the test.

And it is entirely disqualified for production.

The goal of this architecture is complete sovereignty. An autonomous network lab that requires a cloud API key to parse its own BGP events is not autonomous. It is tethered. If the internet goes down, or the API endpoint changes, or the subscription lapses, the system goes blind.

Haiku was brought into this series for one reason: to serve as the calibration weight. It showed me what the ceiling of current capability looks like. It proved that my token budgets were flawed in Part 1. It gave me a standard against which to measure the local models.

But the measuring stick is not the tool.

If I route high-stakes tasks to the cloud, I compromise the environment. Sovereignty means accepting the limitations of your own metal. If Gemma 3 12B hallucinates a detail in a narrative, I must catch it. If Qwen 2.5 Coder 7B generates a flawed topology, the automated deployment checks must reject it. The system must rely on local code and local trust monitors, not external intelligence. You own the hardware, and you own the mistakes.

Haiku is evicted from the production loop. The lab runs on what it physically holds.


The Voice Lane Is Narrow

The T600's 4GB VRAM creates a hard constraint. Models at Q4 quantization that fit inside 3.2GB active VRAM (roughly 4B parameters) run at 27-28 tok/s. At 150 output tokens, that is 5-6 seconds. For a spoken response, that is borderline acceptable. For every other use case, it is fine.

Models above that parameter count, even at Q4, either spill or fill the VRAM envelope and run at 6-14 tok/s. At 150 output tokens, that is 11-25 seconds. Not voice viable.

The 4GB VRAM line is not arbitrary. It is physical. It comes from the hardware, not from a benchmark design choice. The routing consequence: anything that needs to arrive in real-time goes to Gemma 3 4B.


The Architecture That Comes Out of This

The inference layer after this series:

TASK ARRIVES
    │
    ├── Is it a real-time voice response? ─────────────────────────────────→ Gemma 3 4B
    │
    ├── Is it a background code task? ─────────────────────────────────────→ Qwen 2.5 Coder 7B
    │   (topology, scripting, automation)
    │
    └── Is it a background prose task? ────────────────────────────────────→ Gemma 3 12B
        (session digest, incident narrative, structured report)

This is a routing layer, not a fallback chain. Each model has a defined lane based on measured performance. The router makes the call based on task type.

The cost implication: every single task runs on local hardware at zero API cost. The data never leaves the machine.


What the Series Settled and What It Didn't

Settled:

The GPU question. The T600 with ollama-cuda from AUR runs real CUDA inference. 4B models double their throughput. 7B+ models get 1.5× at best. The 4GB VRAM line is the voice loop ceiling.

The specialization question. Qwen 2.5 Coder 7B's code training doesn't come at a prose cost. It tied Mistral Nemo on writing tasks. But it is not the right model for prose when a general-purpose 12B is available. Use the specialized tool for the task it was specialized for.

The 12B question. Gemma 3 12B is the best local prose model on this hardware. Not because 12B is magic, but because this specific model's training produced better narrative reasoning than the alternatives tested.

Not settled:

Whether better prompting changes the local model picture. Gemma 3 12B showed in Part 3 that it has the structural knowledge to pass T3. It produced 13 correct nodes. A schema-explicit prompt might close the gap enough to make it viable for topology generation at reduced throughput. That is a prompt engineering question, not a hardware question.

Whether any local model is voice viable at the next hardware tier. On hardware with 8-12GB VRAM, the throughput picture changes. That is the benchmark to run when the new rig arrives.


Why This Existed

The autonomous system needed a routing decision. Not a feeling about which models were good. Not a manufacturer's benchmark. A series of tasks drawn from the actual workload, run on the actual hardware, scored by someone who knows what a correct BGP triage looks like.

The series exists because "just use the best local model" is not a sentence that means anything. There is no best model. There are models that are right for specific tasks on specific hardware with specific latency constraints. The series produced a routing table, not a winner.

The routing table is now in the code.


The Right Tool ran five parts from May to August 2026. Part 1 established the CPU baseline. Part 2 corrected the test design. Part 3 completed GPU characterization. Part 4 evaluated prose tasks and introduced Mistral Nemo. Part 5 makes the routing call. The next benchmark series runs when the new hardware arrives.