Generative AI on Kubernetes

Pattern

From Deployment to InferenceService

Hand-rolling a Deployment, Service, HPA, and PVC for every model duplicates effort and hides intent. A model server controller introduces a higher-level resource that says: 'serve this model with this runtime', and assembles the rest.