Virtual memory for the GPU KV cache. Pages it into non-contiguous blocks so many sequences share one GPU.
An opinionated tour through model serving, GPU scheduling, autoscaling, observability, and agentic systems — for engineers who already know Kubernetes and want to put LLMs on top of it without making expensive mistakes.
Every lesson begins with a real platform problem — bursty traffic, multi-tenant fine-tunes, agent state, GPU sharing — and walks you to a defensible answer.
Diagrams you can read at a glance. No scrubbing, no transcripts. Each concept sits on its own card, in your hands.
Lessons are 4–7 minutes. Pick up on your phone, finish on your laptop. Progress is yours, not a streak machine.
You will be able to defend the choice of Kubernetes for an LLM workload — and explain what changes when the workload is a 30 GB model.
You will know how to pick a model server, declare it with KServe, and deliver weights without baking them into your image.
You will know how Kubernetes discovers GPUs, when to share them, and how to plan tensor and pipeline parallelism.
You will be able to design an autoscaling, cache-aware, cost-aware inference plane that survives bursty traffic.
You will know which metrics actually matter (TTFT, TPOT, goodput) and how to wire logs, metrics, and traces for streaming workloads.
You will know when to fine-tune, how LoRA changes the serving story, and what gang and topology-aware scheduling buy you.
You will be able to architect a RAG pipeline and a safe agentic system on Kubernetes, with state, identity, and failure domains in mind.
Virtual memory for the GPU KV cache. Pages it into non-contiguous blocks so many sequences share one GPU.
You have CPU-based HPA, no warm pool, 4-minute cold start. What gives first?
Pick model server, autoscaler, GPU strategy, routing, and observability. We'll show where your design holds, where it leaks, and what the book's authors would push back on.
Open the playgroundAPI server, scheduler, controllers — the brains.
Resource requests meet node capacity, with GPU labels.
Queue → batcher → GPU executor → token stream.
Virtual memory for GPU KV cache.
InferenceService over ServingRuntime.
Weights as a sidecar OCI image.
You should be able to read a card. Then a diagram. Then make a defensible call. That's what this is for.
Start now