All paths
Advanced · scaling

Scaling, routing, and disaggregated serving

You will be able to design an autoscaling, cache-aware, cost-aware inference plane that survives bursty traffic.

Progress
0 / 3 lessons
Start
  1. 01
    Autoscaling LLM inference
    Why CPU-based HPA is the wrong answer
  2. 02
    LLM-aware routing and the AI gateway
    Round-robin is malpractice when KV cache is involved
  3. 03
    Disaggregated serving: prefill vs. decode
    Two phases with two GPU appetites — stop running them on the same hardware
Lock it in
Bursty LLM inference endpoint

Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.

Try the scenario