The Project
Cloud LLM APIs are convenient until you start weighing three things against each other: recurring cost, data privacy, and the fact that someone else controls the off-switch. For day-to-day coding help and document-heavy reasoning work, I wanted a model that runs on hardware I control — fast enough to be pleasant to use, and shareable with my team without ever touching the public internet.
So I built one: a self-hosted, OpenAI-compatible inference server running a capable open-weight model locally, privately reachable by my team, and running as an always-on service that survives reboots.
This was greenfield. No managed platform, no turnkey installer — every layer was assembled and tuned by hand.
---
What Was Built
The runtime — Rather than the most beginner-friendly option, I built a high-performance, quantization-focused inference engine from source. It's more hands-on, but it's purpose-built for squeezing large quantized models — especially Mixture-of-Experts architectures — onto modest hardware, with fine-grained control over what runs where. Compiling it meant standing up a GPU toolchain and working around a compiler-version mismatch by pointing the build at a compatible host compiler.
The model — A 30-billion-parameter Mixture-of-Experts model. MoE is the key to local inference: although the model is large in total, only a small fraction of its parameters are active for any given token, so you get the knowledge capacity of a big model at the compute cost of a small one. I deliberately chose a high-fidelity quantization rather than the most aggressive one — for code and compliance-style reasoning, precision is the whole point.
The speed tuning — My first working setup took the easy path: keep the attention layers on the GPU and push every expert layer onto the CPU. It worked, at roughly 24 tokens per second. Then I looked at the GPU utilization and saw the real problem — most of the card was sitting idle. The fix was selective placement: keep the majority of expert layers on the GPU and spill only the overflow to system memory, expressed as a single tensor-placement rule. The result was about 65 tokens per second — roughly 2.7 times faster — with prompt processing also doubling. No new hardware. Just actually using what was already there.
Private sharing — A local model is far more useful when the team can reach it, but the server binds to localhost and has no auth by default. I solved that in two layers. First, an encrypted mesh VPN: every device gets a stable private address reachable from anywhere, with the model's port invisible to the public internet and team members admitted by invitation only. Second, key-based authentication on the server itself — verified both ways, so a request without the key is rejected even from inside the private network. Defense in depth.
Always-on service — Finally I wrapped it as a managed system service so it starts on boot, restarts itself if it ever crashes, and runs with no terminal or session attached. The endpoint simply stays up.
---
Skills Demonstrated
- Systems integration — building an inference engine from source, resolving toolchain and compiler-version conflicts, and configuring GPU acceleration end to end
- Performance engineering — diagnosing underutilized hardware and using model-layer placement to nearly triple throughput
- Applied ML infrastructure — understanding Mixture-of-Experts tradeoffs and choosing quantization to balance memory, speed, and output quality
- Secure networking — private mesh VPN plus key-based API authentication, layered so neither alone is the only line of defense
- Service reliability — turning a manual process into a self-healing, boot-persistent service with clear operational controls
---
Why It Matters
The default answer to "we need an LLM" is to wire up a cloud API and watch the meter run — while your prompts and documents leave your control. This project shows the alternative is genuinely viable on accessible hardware: a fast, private, team-accessible model that nobody else can rate-limit, deprecate, or read.
The most useful lesson wasn't about any single tool. It was that the biggest performance gain came from measuring — noticing the hardware was idle and rebalancing the workload — rather than from spending more. "Does it fit" is the wrong question for quantized MoE models. "What's the most I can fit while leaving room to work" is the right one.
For the cost of some build time and a couple of evenings of tuning, the result is an AI server owned end to end — private by default, fast enough to reach for daily, and reliable enough to hand to a team.