gen-ai · k8s
Kube + LLM, made readable
LearnConceptsReviewPlaygroundSaved
Learn
Visual microlearning

Generative AI on Kubernetes, made readable.

An opinionated tour through model serving, GPU scheduling, autoscaling, observability, and agentic systems — for engineers who already know Kubernetes and want to put LLMs on top of it without making expensive mistakes.

Start your path Browse concepts
paths
7
lessons
21
total
78m
Lesson 03 · Disaggregated serving4/9
Disaggregated prefill + decodeuser prompt2k tokens inprefill workerscompute-boundKV blocksdecode workersmemory-bounduserstreamWhy splitDifferent SKUs winGPU shapes per phaseIndependent autoscalingprefill scales on input tokensBetter goodputphases stop interferingCostCross-pod KV transfer adds latency and complexityWorth it when prompts are long, outputs are short, or one phase dominates
Takeaway
Prefill is compute-bound. Decode is memory-bound. Don't scale them on the same signal.
TTFT p95
1.18s

Built around the work

Every lesson begins with a real platform problem — bursty traffic, multi-tenant fine-tunes, agent state, GPU sharing — and walks you to a defensible answer.

Visual, not video

Diagrams you can read at a glance. No scrubbing, no transcripts. Each concept sits on its own card, in your hands.

Five minutes at a time

Lessons are 4–7 minutes. Pick up on your phone, finish on your laptop. Progress is yours, not a streak machine.

Seven paths

From "why Kubernetes" to "ship an agent"

See all paths
  • Path 01 · Foundation

    Why Kubernetes for generative AI

    You will be able to defend the choice of Kubernetes for an LLM workload — and explain what changes when the workload is a 30 GB model.

    3 lessons
  • Path 02 · Practical

    Model serving on Kubernetes

    You will know how to pick a model server, declare it with KServe, and deliver weights without baking them into your image.

    3 lessons
  • Path 03 · Practical

    GPU scheduling and resource management

    You will know how Kubernetes discovers GPUs, when to share them, and how to plan tensor and pipeline parallelism.

    3 lessons
  • Path 04 · Advanced

    Scaling, routing, and disaggregated serving

    You will be able to design an autoscaling, cache-aware, cost-aware inference plane that survives bursty traffic.

    3 lessons
  • Path 05 · Practical

    Observability for LLM systems

    You will know which metrics actually matter (TTFT, TPOT, goodput) and how to wire logs, metrics, and traces for streaming workloads.

    3 lessons
  • Path 06 · Advanced

    Tuning at scale: LoRA and HPC scheduling

    You will know when to fine-tune, how LoRA changes the serving story, and what gang and topology-aware scheduling buy you.

    3 lessons
  • Path 07 · Advanced

    AI-driven apps: RAG and agents

    You will be able to architect a RAG pipeline and a safe agentic system on Kubernetes, with state, identity, and failure domains in mind.

    3 lessons
What it looks like

Ten card types. Each does one thing well.

Memory
PagedAttention

Virtual memory for the GPU KV cache. Pages it into non-contiguous blocks so many sequences share one GPU.

Compare
Tensor vs. pipeline parallel
  • CommsNVLink-heavy
  • SpanSingle-node
  • LatencyLower TTFT
Scenario
A burst doubles in 30s

You have CPU-based HPA, no warm pool, 4-minute cold start. What gives first?

TTFT goes
Architecture playground

Compose a real LLM platform from primitives.

Pick model server, autoscaler, GPU strategy, routing, and observability. We'll show where your design holds, where it leaks, and what the book's authors would push back on.

Open the playground
Bursty LLM inference endpoint
Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.
Multi-tenant fine-tuned serving
20 tenants, each with their own LoRA adapter on the same 7B base.
GPU sharing for a mixed workload
A shared GPU cluster supports both production inference and best-effort tuning experiments.
Detecting a quality regression
You are about to roll out a new fine-tune. Infra metrics are green.
LayersInferenceService (your intent)model: my-llama, runtime: vllm-h100, scaling: knativeServingRuntime (the engine template)image: vllm/vllm-openai, args, acceleratorsDeployment + Servicerendered podsStorage initializerpulls weightsScalerKnative or HPAWhy this is goodOne template, many modelsPromote new models without rebuilding the runtimeDeclarative, GitOps-friendlyThe control loop ships pods so you don't have to
Concept library

Twenty-two ideas, one diagram each.

Open the library
  • architecture
    Kubernetes control plane

    API server, scheduler, controllers — the brains.

  • gpu
    Pod scheduling for AI

    Resource requests meet node capacity, with GPU labels.

  • model-serving
    Model server

    Queue → batcher → GPU executor → token stream.

  • model-serving
    PagedAttention

    Virtual memory for GPU KV cache.

  • model-serving
    KServe stack

    InferenceService over ServingRuntime.

  • model-serving
    Modelcars

    Weights as a sidecar OCI image.

The point

You shouldn't have to read a 400-page book to ship a model.

You should be able to read a card. Then a diagram. Then make a defensible call. That's what this is for.

Start now
Distilled from public material on running generative AI workloads on Kubernetes.maazghani/genai-on-k8s.dev

Content adapted from Generative AI on Kubernetes by Roland Huß. This visual guide is made freely available through Red Hat's sponsorship of the digital edition.