Scaling, routing, and disaggregated serving
1 / 9
Autoscaling LLM inferenceWhy CPU-based HPA is the wrong answer
Framing

GPU pods do not scale like web pods

Spinning up a new replica might take five minutes (image pull, weight load, warm-up). The autoscaler must react before users feel pain — using signals that actually correlate with saturation.