Architecture playground

Bursty LLM inference endpoint

Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.

Goal

Keep TTFT p95 under 1.5s during bursts, control cost during quiet hours.

Constraints

Compose your reference architecture

0 components selected

Serving

Scaling

Data

Routing

Observability