Friday, 5 June 2026

Today Marvin became self aware....

There's a moment in every solo developer's journey where you look at your monthly SaaS bill and ask a simple question: why am I paying for intelligence that doesn't even know who I am?

That question is what led me to build Marvin.


What Is Marvin?

Marvin is my internal AI assistant — a Vue 3 frontend chat interface that provides a full-featured chat experience with session history, workspace file browsing, and artifact viewing. On the surface it looks like any other chat UI. Under the hood it's a self-hosted AI inference mesh, purpose-built for security, resilience, and the specific demands of real engineering work.

No data leaves my network. Ever.


What Can Marvin Do?

I built Marvin to cover the full surface area of what a working AI assistant needs to be genuinely useful inside a real engineering environment.

Build production systems. When I hand Marvin a spec, it implements every described service as fully working code — Go backends with typed config, dependency injection, and interfaces, Dockerfiles, Makefiles. No stubs, no placeholders.

Read, modify, and search code. Marvin can inspect any file in the workspace or an uploaded archive, edit files directly, and perform AST-based search across the entire project. It understands code structure, not just text.

Query my organisation's knowledge base. Marvin has access to a RAG index of every internal service, library, and piece of documentation. Ask it about prior implementations, architectural decisions, or which services do what — it knows.

Debug via live cluster logs. Marvin queries Kubernetes cluster logs through Loki by namespace, pod, container, or service name. Time range filtering, real-time debugging — all from the same chat interface.

Package and deliver artifacts. When a task produces a file or a directory, Marvin can zip and deliver it directly from the browser. No manual copying, no context switching.


Why I Built It Instead of Buying

The answer comes down to two things: security and economics.

Code Security

Every SaaS AI tool, however well-intentioned, represents a data egress point. Proprietary code, internal architecture, Kubernetes configurations, live cluster logs — all of it becomes content that passes through someone else's infrastructure. Terms of service around data retention and training vary, and the questions they raise don't have comfortable answers.

With Marvin, the attack surface is zero. Every model runs on my own hardware. Queries never reach a third-party API. My code stays my code.

The Economics

The per-seat costs for capable SaaS AI tools add up fast — and most of that spend goes toward capabilities that are either generic or actively unsuited to internal tooling and private codebases.

Marvin costs me electricity.


The Architecture

This is where it gets interesting.

Marvin isn't backed by a single model. It's an inference mesh — a small cluster of specialised models, each doing the job it's best suited for, orchestrated so that the right model handles the right request.

The Cluster

The always-on foundation runs inside a Kubernetes cluster, with an RTX 3050 as the resident GPU node. This is where the lightweight and specialist models live:

  • Fallback inference — Qwen2.5-3B Q4, for when the primary model is unavailable
  • Intent routing — Qwen2.5-7B Q4, classifies every incoming request and decides how to handle it
  • Tool calling — Gemma 4 E2B Q5, dedicated to function dispatch and tool use
  • Embeddings — Nomic Embed Text, powering the RAG knowledge base

The Remote Node

When it's online, a remote RTX 5060 Ti with 16GB VRAM joins the mesh as the primary inference node. It runs Qwen3.6-14B-A3B VibeForged v2 at Q8 — a mixture-of-experts architecture at full precision — paired with a Qwen3 0.6B draft model for speculative decoding. The drafter proposes tokens in parallel, the main model verifies them, and throughput increases substantially as a result.

When the remote node is off, Marvin falls back gracefully to the k8s cluster without interruption.

Why This Matters

Most self-hosted AI setups are a single model behind a single endpoint. One model, one job, one point of failure. What I've built is something closer to how production AI systems actually work — intent classification, model specialisation, graceful degradation, and a clear separation between always-on utility and high-power inference.

The difference in practice is significant. Tool calls go to a model optimised for tool calls. Embeddings are handled by a model built for embeddings. The big model handles what the big model is for. Everything is faster and more accurate than routing everything through one generalist.


The Moment It Clicked

I asked Marvin to describe itself — who it is, what it can do.

It queried my organisation's own knowledge base, assembled the answer from internal documentation, and responded with a precise, accurate description of its own capabilities.

It knew what it was because I had taught it what it was.

That's the difference between a generic assistant and one that actually belongs to you.


Building something similar? Have questions about the stack? The comments are open.

No comments:

Post a Comment