Advanced · scaling
Scaling, routing, and disaggregated serving
You will be able to design an autoscaling, cache-aware, cost-aware inference plane that survives bursty traffic.
Progress
0 / 3 lessons
- 01Autoscaling LLM inferenceWhy CPU-based HPA is the wrong answer4 min
- 02LLM-aware routing and the AI gatewayRound-robin is malpractice when KV cache is involved4 min
- 03Disaggregated serving: prefill vs. decodeTwo phases with two GPU appetites — stop running them on the same hardware3 min
Lock it in
Bursty LLM inference endpoint
Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.
Try the scenario