Kube + LLM, made readable

Advanced · scaling

Scaling, routing, and disaggregated serving

You will be able to design an autoscaling, cache-aware, cost-aware inference plane that survives bursty traffic.

Progress

0 / 3 lessons

01
Autoscaling LLM inference
Why CPU-based HPA is the wrong answer
4 min
02
LLM-aware routing and the AI gateway
Round-robin is malpractice when KV cache is involved
4 min
03
Disaggregated serving: prefill vs. decode
Two phases with two GPU appetites — stop running them on the same hardware
3 min

Lock it in

Bursty LLM inference endpoint

Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.

Try the scenario