I run local language models on two GPUs — a GTX 1650 for general chat and an RTX 3050 on the LAN for heavier coding tasks. A Python FastAPI router handled dispatching between them. It worked, but it was slow to start, awkward to deploy on Kubernetes, and the codebase was accumulating duct tape. So I rewrote it in Go over a weekend. What followed was 18 phases of debugging, design decisions, and one very persistent tool-calling loop.
This is the story of that rewrite.
The starting point
The Python router already did a lot: keyword heuristics to pick a GPU target, an intent classification model for ambiguous cases, CPU speculative draft planning, auto-continuation for truncated responses, and SSE streaming. The goal wasn't to simplify — it was to make the whole thing deployable as a single stateless binary in Kubernetes without a virtualenv or a pip install in sight.
A single Go binary. No dependency hell. No container that's 800MB of Python packages.
The initial structure mapped cleanly onto Go packages:
cmd/router/ — entrypoint internal/config/ — env config internal/router/ — HTTP handlers + routing logic internal/intent/ — intent classifier client internal/inference/ — streaming proxy internal/memory/ — PostgreSQL layer internal/events/ — NATS event bus
Phases 1–3: Ports, pods, and prompt pollution
Deployment issues came fast. The first was a port mismatch — the router listened on :8080 but the Kubernetes service had targetPort: 8000. A five-second fix. The more confusing problem was that /healthz kept returning the Python service's response even after deploying the Go pod. The rollout hadn't completed; I was port-forwarding to the wrong pod. Lesson: always wait on kubectl rollout status.
Routing bugs came next. The intent service — a 1.5B parameter model meant to classify messages as "general" or "coding" — was silently ignoring its classification instructions and generating code instead of JSON. The cause was sending the full conversation history to a model that small. It just couldn't hold the instruction in context. The fix matched what the Python original did: classify only the last user message, truncated to 500 characters.
But even after that fix, the intent model was unreliable enough under real load that I disabled it entirely. The heuristic — keyword matching on the prompt — turned out to be faster, deterministic, and frankly more correct for the workloads I throw at it. The intent model stays in the codebase for when a better model comes along.
Phase 4: Giving the router a memory
The most interesting engineering problem in the whole project was conversation tracking. OpenWebUI — the chat frontend — sends no session ID. Every request is stateless from the client's perspective. I needed to reconstruct conversation continuity from the content alone.
The solution: a rolling SHA256 hash of prior user messages. On each turn, the router hashes all previous user messages it has seen, looks that hash up in PostgreSQL, and finds (or creates) the conversation row. After finding it, it immediately writes the next turn's expected hash so the lookup succeeds on the following request.
This broke in several interesting ways:
- SHA256 of an empty string is a constant — every new conversation shared the same DB row until I added a guard for the empty-hash case.
- Including assistant responses in the hash meant the hash changed unpredictably. Fix: hash only user messages.
- Index-based exclusion (
i < len(messages)-1) broke when message ordering varied. Fix: exclude by content match instead.
The schema ended up as three tables: conversations, messages, and embeddings. The embeddings table uses pgvector's vector(768) type for semantic retrieval.
Phases 5–6: Async events and semantic memory
Every stored message publishes a NATS event. Workers downstream handle the slow stuff asynchronously: a summarisation worker periodically condenses long conversations using one of the inference backends; an embedding worker sends message content to a nomic-embed-text service running on a GPU slice and stores the resulting 768-dimensional vector.
Context injection became a hybrid of two retrieval strategies. The last 4 messages by chronological order provide conversational flow; the top 4 by pgvector cosine similarity recover relevant older context. The semantic pass excludes anything already in the chronological window to avoid duplication.
One subtle ordering bug: I was fetching history after storing the current user message, so the current message appeared in the retrieved history and got filtered out, leaving the context empty. Flipping the fetch to happen before the store fixed it.
Phase 7: Taming OpenWebUI's background requests
OpenWebUI sends three silent background requests after every user message: one for follow-up question suggestions, one for a conversation title, one for tags. Each contains the full conversation history. Because my heuristic router looks for technical keywords, these requests were being sent to the 3050 (the coding backend), locking subsequent routing decisions.
The fix was an isSystemRequest() check that detects the ### Task: prefix these requests share, routes them to the 1650, and skips DB storage entirely. The tricky part was ensuring this check runs at the very top of the handler, before any database operations — an early version stored the system messages first and checked second, polluting the conversation history.
Phases 8–10: Load balancing, IDE integration, MCP
Health-check-based load balancing was straightforward: poll /health on each backend before routing, fall back to the other if the preferred one is down or has no slots. Since the 3050 is a physical node on my LAN, it can go offline at any time. The fallback to the 1650 handles this gracefully.
Integrating Continue.dev as an IDE assistant exposed a new class of problems. Continue sends tools arrays in every agent request — structured tool definitions that the router was silently stripping. The fix required extending the request struct to capture tool definitions, converting them to plain-text system messages the models can follow, parsing {"tool": "...", "args": {...}} patterns in responses, and returning proper OpenAI tool-call SSE so Continue could execute them.
Codebase indexing required adding a POST /v1/embeddings endpoint that proxies to the nomic-embed-text service in OpenAI format. Continue's bundled all-MiniLM-L6-v2 model failed because the WASM runtime couldn't initialise; removing it and pointing to the router's endpoint resolved the issue.
Phases 11–18: The tool-calling rabbit hole
The agent mode work turned into its own multi-phase project. The model (Qwen2.5-Coder-7B) would correctly call ls, receive a file listing, then call ls again. And again. Indefinitely.
The root cause wasn't the model being bad at tool use — it was that Qwen2.5-Coder-7B doesn't natively reason about tool results. It can emit tool call JSON, but it can't interpret the results and decide what to do next. Several approaches failed before landing on stateless in-request loop detection:
detectToolLoop(messages, repeatThreshold=2, maxTurns=30) → Guard 1: same tool+args called N times consecutively → Guard 2: total tool turns ≥ maxTurns hard cap → On trigger: extract unique results, build plain prompt, call StreamAgent(..., jsonSchema=false), return text
A MCP race condition added further complexity: Continue fires the next request before the MCP server returns the tool result, so the same call+result pair appears 10–25 times in body.Messages. A deduplicateToolMessages() pass collapses consecutive identical assistant→tool pairs before building the prompt.
Other improvements in this phase: buffered agent streaming (buffer the entire response, inspect for tool calls, send either structured SSE or plain text — never both); a next-file hint injection that tells the model which file to read next after a glob search; grammar_json_schema constraint via llama.cpp to force valid JSON output; and storing completed read_file results in PostgreSQL so follow-up questions about specific files don't lose their context.
Infrastructure, end state
| Service | Model | Hardware | Role |
|---|---|---|---|
| llama-inference | Qwen2.5-3B-Q4 | GTX 1650 | General chat |
| llama-embedding | nomic-embed-v1.5 | GTX 1650 ×0.5 | Embeddings |
| RTX 3050 node | Qwen2.5-Coder-7B-Q5 | RTX 3050 (LAN) | Coding / agent |
| llm-router | Go binary | Kubernetes | Control plane |
| PostgreSQL + pgvector | pg16 | Kubernetes | Memory |
| NATS JetStream | — | Kubernetes | Event bus |
Working
- Chat routing (heuristic)
- Conversation memory (rolling hash)
- Hybrid semantic + chronological context injection
- Agent mode always routed to 3050
- Tool calls initiated and executed with loop detection
- Codebase indexing via
/v1/embeddings - File contents persisted to memory after agent sessions
- System request detection and lightweight routing
Still flaky / not working
- Model occasionally loops before the next-file hint fires
- Glob search can exhaust max turns before reading any files
- Auto-continuation disabled in agent mode
- Intent model disabled — too unreliable with current models
- Draft model disabled — leaks planning steps into responses
What I'd do differently
The hash-based conversation tracking is the thing I'm least happy with. It's clever but fragile — there's a race condition on turn 2 and it breaks down when conversation order is inconsistent. A proper session ID passed by the client would eliminate all of it. If you control the frontend, use a session ID.
The tool-calling architecture is fundamentally limited by the model. Qwen2.5-Coder-7B is impressive for its size, but it wasn't trained to reason about tool results in a multi-turn loop. The right fix is a model with native function calling — not more prompt engineering. Every hack I added (next-file hints, JSON schema constraints, loop detection) is load-bearing scaffolding around a gap in the model's capabilities.
Go was the right choice. The router is a single binary, deploys in seconds, handles concurrent SSE streams cleanly, and the type system caught several category errors that would have been silent bugs in Python.
What's next
Auto-enrollment (backends register themselves, zero config changes for new nodes), request queueing via NATS when all slots are full, a local agent sidecar on the workstation with real filesystem access, and — when a better model becomes available — re-enabling the intent classifier and the draft planner.
The backlog is longer than when I started. That feels about right.
No comments:
Post a Comment