Visual microlearning

Generative AI on Kubernetes, made readable.

An opinionated tour through model serving, GPU scheduling, autoscaling, observability, and agentic systems — for engineers who already know Kubernetes and want to put LLMs on top of it without making expensive mistakes.

Start your path Browse concepts

paths: 7
lessons: 21
total: 78m

Lesson 03 · Disaggregated serving4/9

Takeaway

Prefill is compute-bound. Decode is memory-bound. Don't scale them on the same signal.

TTFT p95

1.18s

Built around the work

Every lesson begins with a real platform problem — bursty traffic, multi-tenant fine-tunes, agent state, GPU sharing — and walks you to a defensible answer.

Visual, not video

Diagrams you can read at a glance. No scrubbing, no transcripts. Each concept sits on its own card, in your hands.

Five minutes at a time

Lessons are 4–7 minutes. Pick up on your phone, finish on your laptop. Progress is yours, not a streak machine.

Seven paths

From "why Kubernetes" to "ship an agent"

See all paths

What it looks like

Ten card types. Each does one thing well.

Memory

PagedAttention

Virtual memory for the GPU KV cache. Pages it into non-contiguous blocks so many sequences share one GPU.

Compare

Tensor vs. pipeline parallel

CommsNVLink-heavy
SpanSingle-node
LatencyLower TTFT

Scenario

A burst doubles in 30s

You have CPU-based HPA, no warm pool, 4-minute cold start. What gives first?

TTFT goes

Architecture playground

Compose a real LLM platform from primitives.

Pick model server, autoscaler, GPU strategy, routing, and observability. We'll show where your design holds, where it leaks, and what the book's authors would push back on.

Open the playground

Bursty LLM inference endpoint

Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.

Multi-tenant fine-tuned serving

20 tenants, each with their own LoRA adapter on the same 7B base.

GPU sharing for a mixed workload

A shared GPU cluster supports both production inference and best-effort tuning experiments.

Detecting a quality regression

You are about to roll out a new fine-tune. Infra metrics are green.

Concept library

Twenty-two ideas, one diagram each.

Open the library

architecture
Kubernetes control plane
API server, scheduler, controllers — the brains.
gpu
Pod scheduling for AI
Resource requests meet node capacity, with GPU labels.
model-serving
Model server
Queue → batcher → GPU executor → token stream.
model-serving
PagedAttention
Virtual memory for GPU KV cache.
model-serving
KServe stack
InferenceService over ServingRuntime.
model-serving
Modelcars
Weights as a sidecar OCI image.

The point

You shouldn't have to read a 400-page book to ship a model.

You should be able to read a card. Then a diagram. Then make a defensible call. That's what this is for.

Start now

Generative AI on Kubernetes, made readable.

Built around the work

Visual, not video

Five minutes at a time

From "why Kubernetes" to "ship an agent"

Why Kubernetes for generative AI

Model serving on Kubernetes

GPU scheduling and resource management

Scaling, routing, and disaggregated serving

Observability for LLM systems

Tuning at scale: LoRA and HPC scheduling

AI-driven apps: RAG and agents

Ten card types. Each does one thing well.

Compose a real LLM platform from primitives.

Twenty-two ideas, one diagram each.

You shouldn't have to read a 400-page book to ship a model.

Generative AI on Kubernetes, made readable.

Built around the work

Visual, not video

Five minutes at a time

From "why Kubernetes" to "ship an agent"

Why Kubernetes for generative AI

Model serving on Kubernetes

GPU scheduling and resource management

Scaling, routing, and disaggregated serving

Observability for LLM systems

Tuning at scale: LoRA and HPC scheduling

AI-driven apps: RAG and agents

Ten card types. Each does one thing well.

Compose a real LLM platform from primitives.

Twenty-two ideas, one diagram each.

You shouldn't have to read a 400-page book to ship a model.