Friday, 5 June 2026

Today Marvin became self aware....

There's a moment in every solo developer's journey where you look at your monthly SaaS bill and ask a simple question: why am I paying for intelligence that doesn't even know who I am?

That question is what led me to build Marvin.


What Is Marvin?

Marvin is my internal AI assistant — a Vue 3 frontend chat interface that provides a full-featured chat experience with session history, workspace file browsing, and artifact viewing. On the surface it looks like any other chat UI. Under the hood it's a self-hosted AI inference mesh, purpose-built for security, resilience, and the specific demands of real engineering work.

No data leaves my network. Ever.


What Can Marvin Do?

I built Marvin to cover the full surface area of what a working AI assistant needs to be genuinely useful inside a real engineering environment.

Build production systems. When I hand Marvin a spec, it implements every described service as fully working code — Go backends with typed config, dependency injection, and interfaces, Dockerfiles, Makefiles. No stubs, no placeholders.

Read, modify, and search code. Marvin can inspect any file in the workspace or an uploaded archive, edit files directly, and perform AST-based search across the entire project. It understands code structure, not just text.

Query my organisation's knowledge base. Marvin has access to a RAG index of every internal service, library, and piece of documentation. Ask it about prior implementations, architectural decisions, or which services do what — it knows.

Debug via live cluster logs. Marvin queries Kubernetes cluster logs through Loki by namespace, pod, container, or service name. Time range filtering, real-time debugging — all from the same chat interface.

Package and deliver artifacts. When a task produces a file or a directory, Marvin can zip and deliver it directly from the browser. No manual copying, no context switching.


Why I Built It Instead of Buying

The answer comes down to two things: security and economics.

Code Security

Every SaaS AI tool, however well-intentioned, represents a data egress point. Proprietary code, internal architecture, Kubernetes configurations, live cluster logs — all of it becomes content that passes through someone else's infrastructure. Terms of service around data retention and training vary, and the questions they raise don't have comfortable answers.

With Marvin, the attack surface is zero. Every model runs on my own hardware. Queries never reach a third-party API. My code stays my code.

The Economics

The per-seat costs for capable SaaS AI tools add up fast — and most of that spend goes toward capabilities that are either generic or actively unsuited to internal tooling and private codebases.

Marvin costs me electricity.


The Architecture

This is where it gets interesting.

Marvin isn't backed by a single model. It's an inference mesh — a small cluster of specialised models, each doing the job it's best suited for, orchestrated so that the right model handles the right request.

The Cluster

The always-on foundation runs inside a Kubernetes cluster, with an RTX 3050 as the resident GPU node. This is where the lightweight and specialist models live:

  • Fallback inference — Qwen2.5-3B Q4, for when the primary model is unavailable
  • Intent routing — Qwen2.5-7B Q4, classifies every incoming request and decides how to handle it
  • Tool calling — Gemma 4 E2B Q5, dedicated to function dispatch and tool use
  • Embeddings — Nomic Embed Text, powering the RAG knowledge base

The Remote Node

When it's online, a remote RTX 5060 Ti with 16GB VRAM joins the mesh as the primary inference node. It runs Qwen3.6-14B-A3B VibeForged v2 at Q8 — a mixture-of-experts architecture at full precision — paired with a Qwen3 0.6B draft model for speculative decoding. The drafter proposes tokens in parallel, the main model verifies them, and throughput increases substantially as a result.

When the remote node is off, Marvin falls back gracefully to the k8s cluster without interruption.

Why This Matters

Most self-hosted AI setups are a single model behind a single endpoint. One model, one job, one point of failure. What I've built is something closer to how production AI systems actually work — intent classification, model specialisation, graceful degradation, and a clear separation between always-on utility and high-power inference.

The difference in practice is significant. Tool calls go to a model optimised for tool calls. Embeddings are handled by a model built for embeddings. The big model handles what the big model is for. Everything is faster and more accurate than routing everything through one generalist.


The Moment It Clicked

I asked Marvin to describe itself — who it is, what it can do.

It queried my organisation's own knowledge base, assembled the answer from internal documentation, and responded with a precise, accurate description of its own capabilities.

It knew what it was because I had taught it what it was.

That's the difference between a generic assistant and one that actually belongs to you.


Building something similar? Have questions about the stack? The comments are open.

Monday, 27 April 2026

Building a Self-Hosted LLM Router in Go: Semantic Memory, Tool Calling, and 18 Phases of Debugging

I run local language models on two GPUs — a GTX 1650 for general chat and an RTX 3050 on the LAN for heavier coding tasks. A Python FastAPI router handled dispatching between them. It worked, but it was slow to start, awkward to deploy on Kubernetes, and the codebase was accumulating duct tape. So I rewrote it in Go over a weekend. What followed was 18 phases of debugging, design decisions, and one very persistent tool-calling loop.

This is the story of that rewrite.

The starting point

The Python router already did a lot: keyword heuristics to pick a GPU target, an intent classification model for ambiguous cases, CPU speculative draft planning, auto-continuation for truncated responses, and SSE streaming. The goal wasn't to simplify — it was to make the whole thing deployable as a single stateless binary in Kubernetes without a virtualenv or a pip install in sight.

A single Go binary. No dependency hell. No container that's 800MB of Python packages.

The initial structure mapped cleanly onto Go packages:

cmd/router/         — entrypoint
internal/config/    — env config
internal/router/    — HTTP handlers + routing logic
internal/intent/    — intent classifier client
internal/inference/ — streaming proxy
internal/memory/    — PostgreSQL layer
internal/events/    — NATS event bus

Phases 1–3: Ports, pods, and prompt pollution

Deployment issues came fast. The first was a port mismatch — the router listened on :8080 but the Kubernetes service had targetPort: 8000. A five-second fix. The more confusing problem was that /healthz kept returning the Python service's response even after deploying the Go pod. The rollout hadn't completed; I was port-forwarding to the wrong pod. Lesson: always wait on kubectl rollout status.

Routing bugs came next. The intent service — a 1.5B parameter model meant to classify messages as "general" or "coding" — was silently ignoring its classification instructions and generating code instead of JSON. The cause was sending the full conversation history to a model that small. It just couldn't hold the instruction in context. The fix matched what the Python original did: classify only the last user message, truncated to 500 characters.

But even after that fix, the intent model was unreliable enough under real load that I disabled it entirely. The heuristic — keyword matching on the prompt — turned out to be faster, deterministic, and frankly more correct for the workloads I throw at it. The intent model stays in the codebase for when a better model comes along.

Phase 4: Giving the router a memory

The most interesting engineering problem in the whole project was conversation tracking. OpenWebUI — the chat frontend — sends no session ID. Every request is stateless from the client's perspective. I needed to reconstruct conversation continuity from the content alone.

The solution: a rolling SHA256 hash of prior user messages. On each turn, the router hashes all previous user messages it has seen, looks that hash up in PostgreSQL, and finds (or creates) the conversation row. After finding it, it immediately writes the next turn's expected hash so the lookup succeeds on the following request.

This broke in several interesting ways:

  • SHA256 of an empty string is a constant — every new conversation shared the same DB row until I added a guard for the empty-hash case.
  • Including assistant responses in the hash meant the hash changed unpredictably. Fix: hash only user messages.
  • Index-based exclusion (i < len(messages)-1) broke when message ordering varied. Fix: exclude by content match instead.

The schema ended up as three tables: conversations, messages, and embeddings. The embeddings table uses pgvector's vector(768) type for semantic retrieval.

Phases 5–6: Async events and semantic memory

Every stored message publishes a NATS event. Workers downstream handle the slow stuff asynchronously: a summarisation worker periodically condenses long conversations using one of the inference backends; an embedding worker sends message content to a nomic-embed-text service running on a GPU slice and stores the resulting 768-dimensional vector.

Context injection became a hybrid of two retrieval strategies. The last 4 messages by chronological order provide conversational flow; the top 4 by pgvector cosine similarity recover relevant older context. The semantic pass excludes anything already in the chronological window to avoid duplication.

One subtle ordering bug: I was fetching history after storing the current user message, so the current message appeared in the retrieved history and got filtered out, leaving the context empty. Flipping the fetch to happen before the store fixed it.

Phase 7: Taming OpenWebUI's background requests

OpenWebUI sends three silent background requests after every user message: one for follow-up question suggestions, one for a conversation title, one for tags. Each contains the full conversation history. Because my heuristic router looks for technical keywords, these requests were being sent to the 3050 (the coding backend), locking subsequent routing decisions.

The fix was an isSystemRequest() check that detects the ### Task: prefix these requests share, routes them to the 1650, and skips DB storage entirely. The tricky part was ensuring this check runs at the very top of the handler, before any database operations — an early version stored the system messages first and checked second, polluting the conversation history.

Phases 8–10: Load balancing, IDE integration, MCP

Health-check-based load balancing was straightforward: poll /health on each backend before routing, fall back to the other if the preferred one is down or has no slots. Since the 3050 is a physical node on my LAN, it can go offline at any time. The fallback to the 1650 handles this gracefully.

Integrating Continue.dev as an IDE assistant exposed a new class of problems. Continue sends tools arrays in every agent request — structured tool definitions that the router was silently stripping. The fix required extending the request struct to capture tool definitions, converting them to plain-text system messages the models can follow, parsing {"tool": "...", "args": {...}} patterns in responses, and returning proper OpenAI tool-call SSE so Continue could execute them.

Codebase indexing required adding a POST /v1/embeddings endpoint that proxies to the nomic-embed-text service in OpenAI format. Continue's bundled all-MiniLM-L6-v2 model failed because the WASM runtime couldn't initialise; removing it and pointing to the router's endpoint resolved the issue.

Phases 11–18: The tool-calling rabbit hole

The agent mode work turned into its own multi-phase project. The model (Qwen2.5-Coder-7B) would correctly call ls, receive a file listing, then call ls again. And again. Indefinitely.

The root cause wasn't the model being bad at tool use — it was that Qwen2.5-Coder-7B doesn't natively reason about tool results. It can emit tool call JSON, but it can't interpret the results and decide what to do next. Several approaches failed before landing on stateless in-request loop detection:

detectToolLoop(messages, repeatThreshold=2, maxTurns=30)
→ Guard 1: same tool+args called N times consecutively
→ Guard 2: total tool turns ≥ maxTurns hard cap
→ On trigger: extract unique results, build plain prompt,
  call StreamAgent(..., jsonSchema=false), return text

A MCP race condition added further complexity: Continue fires the next request before the MCP server returns the tool result, so the same call+result pair appears 10–25 times in body.Messages. A deduplicateToolMessages() pass collapses consecutive identical assistant→tool pairs before building the prompt.

Other improvements in this phase: buffered agent streaming (buffer the entire response, inspect for tool calls, send either structured SSE or plain text — never both); a next-file hint injection that tells the model which file to read next after a glob search; grammar_json_schema constraint via llama.cpp to force valid JSON output; and storing completed read_file results in PostgreSQL so follow-up questions about specific files don't lose their context.

Infrastructure, end state

ServiceModelHardwareRole
llama-inferenceQwen2.5-3B-Q4GTX 1650General chat
llama-embeddingnomic-embed-v1.5GTX 1650 ×0.5Embeddings
RTX 3050 nodeQwen2.5-Coder-7B-Q5RTX 3050 (LAN)Coding / agent
llm-routerGo binaryKubernetesControl plane
PostgreSQL + pgvectorpg16KubernetesMemory
NATS JetStreamKubernetesEvent bus

Working

  • Chat routing (heuristic)
  • Conversation memory (rolling hash)
  • Hybrid semantic + chronological context injection
  • Agent mode always routed to 3050
  • Tool calls initiated and executed with loop detection
  • Codebase indexing via /v1/embeddings
  • File contents persisted to memory after agent sessions
  • System request detection and lightweight routing

Still flaky / not working

  • Model occasionally loops before the next-file hint fires
  • Glob search can exhaust max turns before reading any files
  • Auto-continuation disabled in agent mode
  • Intent model disabled — too unreliable with current models
  • Draft model disabled — leaks planning steps into responses

What I'd do differently

The hash-based conversation tracking is the thing I'm least happy with. It's clever but fragile — there's a race condition on turn 2 and it breaks down when conversation order is inconsistent. A proper session ID passed by the client would eliminate all of it. If you control the frontend, use a session ID.

The tool-calling architecture is fundamentally limited by the model. Qwen2.5-Coder-7B is impressive for its size, but it wasn't trained to reason about tool results in a multi-turn loop. The right fix is a model with native function calling — not more prompt engineering. Every hack I added (next-file hints, JSON schema constraints, loop detection) is load-bearing scaffolding around a gap in the model's capabilities.

Go was the right choice. The router is a single binary, deploys in seconds, handles concurrent SSE streams cleanly, and the type system caught several category errors that would have been silent bugs in Python.

What's next

Auto-enrollment (backends register themselves, zero config changes for new nodes), request queueing via NATS when all slots are full, a local agent sidecar on the workstation with real filesystem access, and — when a better model becomes available — re-enabling the intent classifier and the draft planner.

The backlog is longer than when I started. That feels about right.

Saturday, 25 April 2026

Under the Hood: How magic-auth Works

The previous post covered getting magic-auth up and running with Docker Compose. This one goes deeper — into the design decisions, security model, and how the moving parts actually fit together. If you've ever wondered what a self-hosted OIDC Identity Provider looks like from the inside, this is that post.


The Server: Go and Nothing Else

magic-auth is written in Go using only the standard library's net/http package — no web framework, no ORM, no router library. This is a deliberate choice. The binary is compiled to a scratch container, meaning the final Docker image contains a single executable and nothing else: no shell, no libc, no package manager, no attack surface beyond the server itself. The result is an image around 10 MB in size.

The schema — users, sessions, clients, tokens, RBAC rules — is created and migrated automatically on startup. There is no manual database setup step. The server supports two storage backends selectable via environment variable: rqlite, a lightweight distributed SQLite over Raft, and PostgreSQL. For most self-hosted deployments rqlite is the simpler choice since it runs as its own container with no external dependencies.


The Magic Link: What Actually Happens

When a user submits their email address, the server does the following:

  1. Looks up whether the address is registered. If it is not, the response is identical to the success case — a deliberate measure to prevent user enumeration.
  2. Generates a cryptographically random token, stores a bcrypt hash of it in the database against the user's session record, and constructs a verification URL containing the token and session ID.
  3. Publishes a JSON payload to NATS JetStream. The server's job ends here — it does not speak SMTP. Whatever consumer you have subscribed to that NATS subject is responsible for delivering the email.

The verification URL contains two parameters: a session ID and a token. When the user clicks the link, the server retrieves the session, verifies the token against the stored bcrypt hash, and then checks the browser fingerprint.

The fingerprint is an HMAC computed from the user's IP address, User-Agent header, and Accept-Language header at the time the magic link was requested. The same HMAC is recomputed at the time the link is clicked. If the values do not match — because the link was opened on a different device, from a different network, or in a different browser — the verification is rejected. This is a security tradeoff worth understanding: it prevents a stolen link from being used from a different context, but it also means a link forwarded from a desktop email client opened on a phone will fail.

Magic links are single-use. Once a token is verified it is deleted from the database. The link is also time-limited to 15 minutes.


JWT Signing: RS256 and ES256

magic-auth issues signed JWTs for all tokens — access tokens, refresh tokens, and the OIDC id_token. Two signing algorithms are supported:

  • RS256 (RSASSA-PKCS1-v1_5 with SHA-256) — uses a 2048-bit RSA key pair. Most broadly compatible with third-party libraries and services.
  • ES256 (ECDSA with P-256 and SHA-256) — uses a smaller EC key pair. Produces smaller tokens and verifies faster, but slightly less universally supported.

The private key is supplied as a PEM-encoded environment variable at startup. The corresponding public key is exposed via the standard JWKS endpoint at /.well-known/jwks.json, which includes the x5c certificate chain field. Any service that needs to verify tokens can fetch the public key from this endpoint and verify signatures locally without calling back to the IdP.

The OIDC discovery document at /.well-known/openid-configuration points to all the standard endpoints and declares the supported signing algorithms, so compliant clients can configure themselves automatically from a single URL.


Token Lifetimes and Rotation

Token lifetimes are fixed values baked into the server:

TokenLifetime
Access token8 hours
Refresh token14 days
Refresh token renewal window7 days
Magic link15 minutes, single-use

Refresh tokens rotate on every use. When a client presents a refresh token, the server issues a new access token and a new refresh token, and the old refresh token is immediately invalidated. If a previously revoked refresh token is ever presented again — indicating possible token theft — the server revokes all active sessions for that user immediately. This is the standard refresh token rotation security model described in RFC 6819 and the OAuth 2.0 Security Best Current Practice.

Roles are embedded in the JWT payload at every token issuance and refresh. This means role changes take effect at the next token mint — no logout is required.


The OIDC Layer

magic-auth implements a complete OpenID Connect Authorization Server. The full endpoint surface is:

GET  /.well-known/openid-configuration   Discovery document
GET  /.well-known/jwks.json              Public key set
POST /oauth/register                     Dynamic client registration (RFC 7591)
GET  /oauth/authorize                    Authorization code flow
POST /oauth/token                        Token exchange / refresh
GET  /oauth/userinfo                     Claims for the bearer
POST /oauth/revoke                       Token revocation (RFC 7009)

Dynamic client registration (RFC 7591) means new applications can register themselves programmatically with a single API call — no admin portal required for client onboarding. The server supports both confidential clients (server-side apps with a client_secret) and public clients (SPAs and mobile apps using PKCE with no secret).

PKCE (Proof Key for Code Exchange, RFC 7636) is required for public clients. It prevents authorization code interception attacks by binding the authorization request to a secret known only to the initiating client. The code_challenge is a SHA-256 hash of a random code_verifier; the verifier is submitted at token exchange and verified server-side. Even if an attacker intercepts the authorization code, they cannot exchange it without the original verifier.


The PKCE Client Implementation in magic-auth-ui

The companion management UI implements PKCE entirely in the browser using the Web Crypto API — no third-party OAuth library involved. Here is what happens step by step when the UI initiates a login:

  1. Generate a 256-bit random code_verifier using crypto.getRandomValues
  2. Compute the code_challenge as BASE64URL(SHA256(verifier)) using crypto.subtle.digest
  3. Generate a 128-bit random state for CSRF protection
  4. Generate a 128-bit random nonce for id_token replay protection
  5. Store the verifier, state, nonce, and the intended post-login destination in sessionStorage
  6. Redirect the browser to /oauth/authorize with all parameters

On the callback after the user has clicked their magic link:

  1. Validate the returned state against the stored value — mismatch means a possible CSRF and the flow aborts
  2. Delete the one-time values from sessionStorage immediately
  3. POST the authorization code and code_verifier to /oauth/token
  4. Verify the returned id_token client-side: fetch the correct signing key from JWKS by kid, import it via crypto.subtle.importKey, verify the signature, check iss, aud, exp, iat, and nonce
  5. Store the access token in JS module memory only — it is never written to localStorage or sessionStorage
  6. Store the refresh token in sessionStorage — it survives page reloads within the same tab but is cleared when the tab is closed

The JWKS cache is held in memory and keyed by kid. If a token arrives with an unknown kid — which would happen after a key rotation — the cache is refreshed automatically.

Silent Token Refresh

The access token is kept alive by a proactive refresh timer. When tokens are stored, a setTimeout is scheduled to fire 60 seconds before the access token expires. If the refresh succeeds, new tokens are stored and the timer is rescheduled. If the refresh fails — because the refresh token has expired or been revoked — tokens are cleared and the user is redirected to the login page.

On a full page reload, the in-memory access token is lost. The router's global navigation guard runs auth.init() on the first navigation, which checks for a refresh token in sessionStorage and attempts a silent refresh before deciding whether the user is authenticated. This means sessions survive tab refreshes without prompting the user to sign in again.


Email Delivery via NATS JetStream

Decoupling email delivery from the authentication server is one of the more useful design decisions in magic-auth. Rather than bundling SMTP configuration into the server, magic-auth publishes a structured JSON message to a NATS JetStream subject and leaves delivery entirely to an external consumer.

The payload looks like this:

{
  "to":      ["user@example.com"],
  "subject": "Your sign-in link",
  "body":    "Click the link below to sign in:\n\nhttps://auth.example.com/api/auth/verify?id=...&token=...",
  "is_html": false,
  "cc":      [],
  "bcc":     [],
  "headers": {
    "From":         "noreply@example.com",
    "X-Mailer":     "magiclink-auth",
    "X-Token-Type": "magic-link"
  }
}

The NATS stream is created automatically on startup if it does not already exist. The stream is configured with a maximum age of 24 hours and a maximum size of 128 MB by default, both overridable via environment variables. This means if your email consumer is temporarily down, messages will be retained for up to 24 hours and delivered when the consumer reconnects — rather than silently dropped.

The consumer can be written in any language. The only contract is: subscribe to the configured subject, deliver the email, call msg.Ack(). If delivery fails, do not ack — NATS will redeliver. Add a dead-letter queue for messages that exhaust retries.


The Role System

Roles are resolved fresh at every token issuance using a four-level priority chain. Given a user and a client, the server evaluates in this order and uses the first match:

  1. User role override — an explicit per-user, per-client assignment set via the admin API. This is the highest priority and overrides everything else.
  2. RBAC email rule — a rule matching the user's exact email address for this client.
  3. RBAC domain rule — a rule matching the user's email domain for this client. Useful for granting all users at a company a specific role without listing each address individually.
  4. Config default — falls back to ["user"]. Configurable per server.

Rules with client_id="*" match all clients, including direct-flow tokens. This makes it straightforward to grant a global admin role from a single rule without repeating it per client.

Custom roles can be created per client and optionally set as the default role for first-time logins to that client. This allows each application to define its own role vocabulary while still delegating authentication to a central IdP.

Because roles are embedded in the JWT at mint time, the server needs no separate token introspection call to enforce them. Applications can validate the JWT signature locally using the JWKS endpoint and read roles directly from the roles claim.


SSO Session Sharing

The SSO layer is built on top of the standard authentication flow rather than replacing it. When SSO is enabled globally and a client opts in, the server sets an additional cookie — __idp_session — after successful authentication. This cookie is HttpOnly, SameSite=Lax, and scoped to the IdP domain.

On a subsequent login request to another opted-in client, the server checks whether the submitted email matches the active SSO session. If it does, the server skips the magic link step entirely and proceeds directly to issuing an authorization code. If the emails do not match — because the user wants to switch accounts — the normal flow runs regardless.

This design means the user always has to type their email. There is no invisible automatic sign-in. The ability to switch accounts is always present, and the SSO session can never silently sign in under the wrong identity.

The SSO session is cleared on POST /api/auth/logout, POST /oauth/revoke, and GET /logout. The GET /logout endpoint is designed for cross-origin logout redirects — it clears the SSO cookie and then redirects the browser to the URL specified in the redirect query parameter.


Server Configuration Without Restarts

Runtime configuration — SSO toggle, session TTL, registration policy, allowed redirect domains — is stored in the database rather than in environment variables. This means it can be changed via the API and takes effect immediately, with a 30-second cache to reduce database reads. No container restart is needed.

Environment variables still handle secrets and infrastructure concerns: signing keys, database DSN, NATS URL, HMAC secrets. These are genuinely startup-time concerns. The distinction is deliberate: operational configuration belongs in the database, secrets belong in environment variables.


The Direct Magic Link Flow

Not every application needs full OIDC. magic-auth also supports a simpler direct flow for apps that just want session cookies managed by the IdP:

POST /api/auth/request    # Submit email, trigger magic link
GET  /api/auth/verify     # User clicks link — cookies are set
GET  /api/auth/me         # Check the current session
POST /api/auth/refresh    # Rotate refresh token
POST /api/auth/logout     # Clear all cookies and SSO session

In this flow the server sets access_token, refresh_id, and refresh_token cookies directly on successful verification. There is no authorization code redirect. This is simpler to integrate for server-rendered applications that do not need portable JWTs — though the cookies are still signed JWTs, just delivered as cookies rather than via the token endpoint.


What This Adds Up To

The architectural picture is a small, auditable server with clearly separated concerns: authentication logic in Go, email delivery decoupled via NATS, storage pluggable between rqlite and Postgres, token signing via standard asymmetric keys, and a full OIDC surface that any compliant client can consume without custom integration work.

None of these are novel ideas individually. The value is in how tightly they fit together in something small enough to understand completely, deploy in minutes, and operate without a dedicated platform team.

The Docker images are on Docker Hub:
API: jlcox1970/magiclink-auth
UI: jlcox1970/magiclink-ui

Setup guide: Building a Passwordless Auth System with magic-auth