Model serving on Kubernetes
1 / 8
Anatomy of a model serverWhy you almost never wrap PyTorch in Flask in production
Definition

A model server is a runtime, not a wrapper

A model server loads weights, schedules concurrent requests onto a GPU, batches them, manages KV cache, and exposes an OpenAI-style API. Wrapping a model in Flask gives you correctness; a real server gives you throughput.