Generative AI on Kubernetes

The asymmetry

Prefill is compute-bound, decode is memory-bound

Prefill processes the whole prompt in one shot — it loves dense compute. Decode emits one token at a time and is bottlenecked by KV-cache bandwidth. Mixing them on one GPU means each phase fights the other.