Scaling, routing, and disaggregated serving
1 / 8
Disaggregated serving: prefill vs. decodeTwo phases with two GPU appetites — stop running them on the same hardware
The asymmetry

Prefill is compute-bound, decode is memory-bound

Prefill processes the whole prompt in one shot — it loves dense compute. Decode emits one token at a time and is bottlenecked by KV-cache bandwidth. Mixing them on one GPU means each phase fights the other.