Siddharth Khatri

Case Study

CoachKai

My own local-first hyperspecialized AI coaching system to help me consistently improve my skills.

I play competitive squash, train seriously, and care about recovery. Every tool I tried gave me generic advise and treated me like a template, not an individual. So I built my own on-device AI system that runs on my hardware, reasons over my actual context, and evolves with my data.

Type
Personal AI System
Domain
Fitness · Recovery · Squash
Model
Local · Open-weight
Status
Active build
CoachKai — local AI coaching interface

The Problem

Every tool I tried knew nothing about me.

I play squash at a competitive level, follow a structured training program, and deal with specific injury history that shapes how I train and recover. Not a single AI tool I tried had any awareness of that. They gave me the same advice they'd give anyone.

My wearables logged everything but couldn't reason over it. Generic LLM chatbots could reason, but without my context they just sounded confident while being useless. I found myself not trusting any of it enough to act on.

I needed something that actually knew me — my training history, my recovery patterns, my squash game, my goals. Something that ran locally so my data stayed mine. That's why I built CoachKai.

Privacy-first
My data never leaves my machine. Everything runs locally.
Built for me
Squash, fitness, and recovery — shaped around my actual life, not generic wellness.
Context-aware
Reasons over my real history, goals, and injury patterns.
Transparent
I can see exactly what knowledge it uses to answer me.
Why CoachKai was built

Build Journey

From frustration to working system.

01

Identifying My Problem

I got frustrated enough with generic tools to ask myself what I actually needed. I mapped out what a coaching system would have to do to be genuinely useful to me — not in theory, but in my daily training and recovery routine.

02

Evaluating Local Models

I tested a range of open-weight LLMs — Mistral, LLaMA variants, Phi, Gemma — against what I actually needed: good reasoning, solid instruction-following, and domain fit for fitness and squash coaching.

03

Hitting My Hardware Ceiling

My Mac hit VRAM walls almost immediately with 7B+ models. I had to face the reality that the hardware I was working on wasn't going to cut it before I could commit to any architecture decision.

04

Upgrading My Setup

I moved to a dedicated Windows machine with 32 GB RAM and an NVIDIA GPU. That unlocked practical inference speeds for the larger models I needed and finally gave me a real experimentation cycle.

05

Testing Locally with My Own Scripts

I wrote my own Python and PowerShell scripts to automate model queries, benchmark responses, and test context injection. Running everything in local terminal environments meant I could iterate fast and see exactly what was happening.

06

My First Knowledge Experiments

I started by feeding my own PDFs and fitness books into the retrieval pipeline. The results were frustrating — irrelevant chunks, bloated outputs, almost no domain specificity. I knew I had to rethink the whole approach.

07

Switching to Structured Markdown

I moved first to plain markdown, then to structured long-form markdown with consistent formatting and explicit schema I defined myself. The difference was immediate — retrieval became precise in a way I hadn't seen before.

08

Working Through RAG and Architecture

I evaluated different retrieval strategies — naive RAG, chunked embeddings, structured retrieval over markdown. I made deliberate choices about chunk size, embedding model, and retrieval depth based on what actually worked for my use case.

09

Shaping My MVP

I scoped the MVP deliberately. I left out full context memory because I didn't trust it yet. I wanted a coaching loop I could rely on before I added anything that might undermine it. Getting the core right mattered more than shipping fast.

System Architecture

I designed it for trust, not just capability.

The thing I cared most about was being able to trust what the system told me. So I separated every concern clearly — my knowledge lives in its own vault, retrieval is explicit and I can inspect it, and the model only reasons over what I deliberately give it.

CoachKai system architecture
01
Knowledge Vault

My curated markdown files — training logs, injury history, squash notes, nutrition context. I structured these myself with a consistent schema so retrieval stays precise.

02
Retrieval Layer

Semantic search over my chunked markdown with controlled context injection. I can see exactly what gets retrieved and why — no black box.

03
System Instructions

I defined Kai's persona, constraints, and coaching behaviors in a persistent system prompt. It responds as my coach, not as a generic assistant.

04
Local Model

An open-weight LLM running entirely on my own hardware. No API calls, no cloud, no data leaving my machine. I control the inference completely.

05
Query Interface

A CLI I use to run coaching queries, review sessions, and inspect exactly what context the model received. Built for my workflow, not a general audience.

Hardware & Model Selection

I couldn't pick a model until I faced my hardware limits.

I learned quickly that model selection isn't abstract — it's completely shaped by what hardware you're running on. My Mac wasn't going to cut it. Upgrading my setup changed everything about what I could realistically build.

Before
MacBook — 16 GB unified memory

7B+ models ran slowly or failed to load entirely. Quantization degraded output quality to the point where it wasn't worth running.

After
Windows — 32 GB RAM + NVIDIA GPU

Full Q4/Q5 quants of 13B–34B models at usable speeds. I could finally run proper experiments and actually iterate on the system.

Model
Why it was considered
Where it fell short / stood out
CoachKai decision
Phi-3 Mini
Extremely lightweight and fast
Too small for nuanced coaching and synthesis
Rejected as primary model
Mistral 7B
Strong small-model baseline
Good, but not the best depth for domain reasoning
Useful early benchmark
Llama 3.1 8B
Great instruction following and local practicality
Strong all-rounder, but not the deepest option
Serious contender
Gemma 2 9B
Clean outputs and good reasoning for size
Strong compact option, but not the best final fit
Considered, not selected
Qwen 14B
Best balance of reasoning and deployability
Stronger fit for structured coaching + RAG
Selected
Llama 3.1 70B
Best reasoning quality in the set
Too heavy for the MVP tradeoff profile
Aspirational / post-MVP tier
Model benchmarking

Knowledge + RAG Design

I learned that what I fed it mattered more than the model itself.

I went through three distinct phases building the knowledge layer. Each one taught me something I couldn't have read in a paper — I had to feel the difference in the outputs myself.

Phase 01Dropped

PDFs and Books

I started by dumping my own fitness books and PDFs into the retrieval pipeline. The outputs were all over the place — chunks from unrelated sections, verbose answers that missed the point entirely. I kept trying to fix it with prompts. That was the wrong instinct.

Dropped — high noise, low specificity
Phase 02Improved

Plain Markdown Files

I converted my knowledge into plain markdown files. Things got noticeably better — cleaner chunks, more relevant outputs. But I hadn't been consistent with how I structured things, and those gaps showed up in the responses.

Improved — better precision, but inconsistent
Phase 03Current

Structured Long-Form Markdown

I defined a proper schema for all my knowledge files — standard sections, uniform headings, explicit context blocks. Once I applied it consistently, retrieval became reliable in a way that genuinely surprised me. The coaching outputs finally felt useful.

Current — high precision, consistent outputs

Key insight — I spent too long trying to fix outputs with better prompts. The real problem was always the knowledge structure. Once I got that right, the model didn't need much help at all.

Live Demo

Coaching in action.

This is CoachKai responding to my actual coaching queries — pulling from my structured markdown knowledge base to give me context-aware answers about my training, recovery, and squash game, all running on my own machine.

coachkai — local inference

Full session: my query → retrieval from my knowledge vault → response, running locally on my machine while pulling data from WHOOP API

Retrospective

What worked for me. What didn't.

What Worked
Structured markdown knowledge

Enforcing a consistent schema across knowledge files — typed headers, explicit date fields, structured workout blocks — produced measurable retrieval improvements. The embedding model chunk boundaries aligned with semantic units, which meant top-k results were actually relevant.

Upgrading my hardware

Moving from 16 GB unified memory to 32 GB RAM + NVIDIA GPU let me run Q4_K_M quants of 13B–34B models at usable inference speeds. The hardware ceiling was directly limiting which model architectures I could test and how fast I could iterate.

Writing an explicit system persona

Hard-coding role constraints, domain boundaries, and response format expectations into the system prompt eliminated a whole class of hallucination. The model stopped confabulating outside its knowledge scope because the prompt structure left no room for it.

Keeping context session-scoped

Injecting only the current session's retrieved chunks — rather than accumulating a rolling context — kept the prompt tight and made outputs traceable. I could inspect exactly which source docs drove each response and catch retrieval failures early.

Saying no to features early

Deferring wearable integrations, voice I/O, and auto-ingestion pipelines until the core retrieve-augment-generate loop was stable prevented compounding failure modes. Each deferred dependency reduced the surface area I had to debug.

What Didn't Work
Dumping raw documents in

Ingesting unstructured PDFs and book chapters produced fragmented chunks that confused the retrieval layer. Cosine similarity scores were high but semantic relevance was low — the embeddings were matching surface tokens, not meaning.

Expanding the context window

Increasing context length didn't improve output quality. With weak retrieval, more context just meant more noise for the model to reason over. Precision in what got retrieved mattered far more than the size of the window.

Inconsistent markdown formatting

Variability in heading depth, list structure, and field naming caused chunking to split semantic units arbitrarily. The retriever was pulling fragments instead of complete thoughts, which degraded the model's ability to synthesise a coherent answer.

Running sub-7B models locally

Models under 7B parameters lacked the instruction-following depth and multi-step reasoning needed for nuanced coaching responses. They'd retrieve correctly but fail to synthesise across multiple chunks — a reasoning gap, not a retrieval gap.

Prompt engineering around bad retrieval

I wasted significant iteration cycles tuning prompts when the root cause was low-quality retrieved context. No amount of prompt engineering compensates for a retriever returning irrelevant chunks — the fix had to happen upstream.

What I'd Build Next

Where I want to take it.

01
Long-term memory layer

I want the system to remember my training history over time — injury patterns, performance trends, what worked and what didn't — so coaching gets smarter the longer I use it.

02
Wearable data integration

My HRV, sleep, and activity data should feed automatically into the knowledge base. I'm still logging manually and it slows the feedback loop.

03
Domain-tuned retrieval

I want an embedding model that understands sports science and fitness terminology specifically — so retrieval gets more precise on the technical stuff.

04
A proper coaching UI

The CLI works for me now, but I want a clean daily interface — quick check-ins, session summaries, recovery recommendations — without needing to think about the system underneath.

05
Response quality evaluation

I want to measure whether the coaching is actually getting better over time. That means building a benchmark and testing against it systematically.

06
Adaptive personalization

Eventually I want the system to pick up on my patterns — how I respond to training load, what my injury warning signs look like — and adjust its coaching without me having to prompt it.

What I Took Away

Building this taught me that the model is the least interesting part.

I went into this thinking the hard part would be picking the right model. I was wrong. The real work was designing the knowledge architecture, making retrieval reliable, defining the right system boundaries, and being honest with myself about scope. That's where the product actually lives.

Retrieval quality is product quality

I learned this the hard way. What gets retrieved is what the model reasons over — getting that right made more difference than any model swap I tried.

Structured knowledge beats document dumps

Volume never helped me. Curated, consistently formatted knowledge files did. I now think of knowledge architecture as a real engineering discipline.

Hardware constraints are product constraints

I couldn't design around my hardware limits — I had to start from them. Abstract benchmarks meant nothing until I ran things on my actual machine.

Trust comes from knowing what it doesn't know

The moments I trusted CoachKai most were when it stayed in its lane. Scope discipline and explicit constraints are what made it feel reliable.

Judgment matters more than capability

A well-scoped, well-structured system with a modest model consistently outperformed a capable model given sloppy context. I've stopped chasing frontier models.