Generative AI on Kubernetes

Definition

A model server is a runtime, not a wrapper

A model server loads weights, schedules concurrent requests onto a GPU, batches them, manages KV cache, and exposes an OpenAI-style API. Wrapping a model in Flask gives you correctness; a real server gives you throughput.