Generative AI on Kubernetes

Framing

GPU pods do not scale like web pods

Spinning up a new replica might take five minutes (image pull, weight load, warm-up). The autoscaler must react before users feel pain — using signals that actually correlate with saturation.