Architecture playground
Bursty LLM inference endpoint
Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.
Goal
Keep TTFT p95 under 1.5s during bursts, control cost during quiet hours.
Constraints
- Single region
- Limited GPU budget — 6 H100s
- OpenAI-compatible API required
Compose your reference architecture
0 components selectedServing
Scaling
Data
Routing
Observability