Case Study
CoachKai
My own local-first hyperspecialized AI coaching system to help me consistently improve my skills.
I play competitive squash, train seriously, and care about recovery. Every tool I tried gave me generic advise and treated me like a template, not an individual. So I built my own on-device AI system that runs on my hardware, reasons over my actual context, and evolves with my data.
The Problem
Every tool I tried knew nothing about me.
I play squash at a competitive level, follow a structured training program, and deal with specific injury history that shapes how I train and recover. Not a single AI tool I tried had any awareness of that. They gave me the same advice they'd give anyone.
My wearables logged everything but couldn't reason over it. Generic LLM chatbots could reason, but without my context they just sounded confident while being useless. I found myself not trusting any of it enough to act on.
I needed something that actually knew me — my training history, my recovery patterns, my squash game, my goals. Something that ran locally so my data stayed mine. That's why I built CoachKai.
Build Journey
From frustration to working system.
Identifying My Problem
I got frustrated enough with generic tools to ask myself what I actually needed. I mapped out what a coaching system would have to do to be genuinely useful to me — not in theory, but in my daily training and recovery routine.
Evaluating Local Models
I tested a range of open-weight LLMs — Mistral, LLaMA variants, Phi, Gemma — against what I actually needed: good reasoning, solid instruction-following, and domain fit for fitness and squash coaching.
Hitting My Hardware Ceiling
My Mac hit VRAM walls almost immediately with 7B+ models. I had to face the reality that the hardware I was working on wasn't going to cut it before I could commit to any architecture decision.
Upgrading My Setup
I moved to a dedicated Windows machine with 32 GB RAM and an NVIDIA GPU. That unlocked practical inference speeds for the larger models I needed and finally gave me a real experimentation cycle.
Testing Locally with My Own Scripts
I wrote my own Python and PowerShell scripts to automate model queries, benchmark responses, and test context injection. Running everything in local terminal environments meant I could iterate fast and see exactly what was happening.
My First Knowledge Experiments
I started by feeding my own PDFs and fitness books into the retrieval pipeline. The results were frustrating — irrelevant chunks, bloated outputs, almost no domain specificity. I knew I had to rethink the whole approach.
Switching to Structured Markdown
I moved first to plain markdown, then to structured long-form markdown with consistent formatting and explicit schema I defined myself. The difference was immediate — retrieval became precise in a way I hadn't seen before.
Working Through RAG and Architecture
I evaluated different retrieval strategies — naive RAG, chunked embeddings, structured retrieval over markdown. I made deliberate choices about chunk size, embedding model, and retrieval depth based on what actually worked for my use case.
Shaping My MVP
I scoped the MVP deliberately. I left out full context memory because I didn't trust it yet. I wanted a coaching loop I could rely on before I added anything that might undermine it. Getting the core right mattered more than shipping fast.
System Architecture
I designed it for trust, not just capability.
The thing I cared most about was being able to trust what the system told me. So I separated every concern clearly — my knowledge lives in its own vault, retrieval is explicit and I can inspect it, and the model only reasons over what I deliberately give it.
Hardware & Model Selection
I couldn't pick a model until I faced my hardware limits.
I learned quickly that model selection isn't abstract — it's completely shaped by what hardware you're running on. My Mac wasn't going to cut it. Upgrading my setup changed everything about what I could realistically build.
Knowledge + RAG Design
I learned that what I fed it mattered more than the model itself.
I went through three distinct phases building the knowledge layer. Each one taught me something I couldn't have read in a paper — I had to feel the difference in the outputs myself.
PDFs and Books
I started by dumping my own fitness books and PDFs into the retrieval pipeline. The outputs were all over the place — chunks from unrelated sections, verbose answers that missed the point entirely. I kept trying to fix it with prompts. That was the wrong instinct.
Plain Markdown Files
I converted my knowledge into plain markdown files. Things got noticeably better — cleaner chunks, more relevant outputs. But I hadn't been consistent with how I structured things, and those gaps showed up in the responses.
Structured Long-Form Markdown
I defined a proper schema for all my knowledge files — standard sections, uniform headings, explicit context blocks. Once I applied it consistently, retrieval became reliable in a way that genuinely surprised me. The coaching outputs finally felt useful.
Key insight — I spent too long trying to fix outputs with better prompts. The real problem was always the knowledge structure. Once I got that right, the model didn't need much help at all.
Live Demo
Coaching in action.
This is CoachKai responding to my actual coaching queries — pulling from my structured markdown knowledge base to give me context-aware answers about my training, recovery, and squash game, all running on my own machine.
Full session: my query → retrieval from my knowledge vault → response, running locally on my machine while pulling data from WHOOP API
Retrospective
What worked for me. What didn't.
Enforcing a consistent schema across knowledge files — typed headers, explicit date fields, structured workout blocks — produced measurable retrieval improvements. The embedding model chunk boundaries aligned with semantic units, which meant top-k results were actually relevant.
Moving from 16 GB unified memory to 32 GB RAM + NVIDIA GPU let me run Q4_K_M quants of 13B–34B models at usable inference speeds. The hardware ceiling was directly limiting which model architectures I could test and how fast I could iterate.
Hard-coding role constraints, domain boundaries, and response format expectations into the system prompt eliminated a whole class of hallucination. The model stopped confabulating outside its knowledge scope because the prompt structure left no room for it.
Injecting only the current session's retrieved chunks — rather than accumulating a rolling context — kept the prompt tight and made outputs traceable. I could inspect exactly which source docs drove each response and catch retrieval failures early.
Deferring wearable integrations, voice I/O, and auto-ingestion pipelines until the core retrieve-augment-generate loop was stable prevented compounding failure modes. Each deferred dependency reduced the surface area I had to debug.
Ingesting unstructured PDFs and book chapters produced fragmented chunks that confused the retrieval layer. Cosine similarity scores were high but semantic relevance was low — the embeddings were matching surface tokens, not meaning.
Increasing context length didn't improve output quality. With weak retrieval, more context just meant more noise for the model to reason over. Precision in what got retrieved mattered far more than the size of the window.
Variability in heading depth, list structure, and field naming caused chunking to split semantic units arbitrarily. The retriever was pulling fragments instead of complete thoughts, which degraded the model's ability to synthesise a coherent answer.
Models under 7B parameters lacked the instruction-following depth and multi-step reasoning needed for nuanced coaching responses. They'd retrieve correctly but fail to synthesise across multiple chunks — a reasoning gap, not a retrieval gap.
I wasted significant iteration cycles tuning prompts when the root cause was low-quality retrieved context. No amount of prompt engineering compensates for a retriever returning irrelevant chunks — the fix had to happen upstream.
What I'd Build Next
Where I want to take it.
What I Took Away
Building this taught me that the model is the least interesting part.
I went into this thinking the hard part would be picking the right model. I was wrong. The real work was designing the knowledge architecture, making retrieval reliable, defining the right system boundaries, and being honest with myself about scope. That's where the product actually lives.
