Generative AI on Kubernetes

The tension

GPUs are scarce, so sharing matters

An H100 is too big for a small model and too small for a 70B model. The platform has to decide: hand out whole GPUs, slice them, or coordinate several. Each choice has a different fairness, isolation, and performance shape.