Scaling, routing, and disaggregated serving
1 / 8
LLM-aware routing and the AI gatewayRound-robin is malpractice when KV cache is involved
Why this differs

Identical replicas have non-identical state

Two LLM replicas may run the same image, but each carries its own KV cache full of recent prompts. Routing the next message in a conversation to a random replica throws that cache away and recomputes prefill from scratch.