gen-ai · k8s
Kube + LLM, made readable
LearnConceptsReviewPlaygroundSaved
Learn
All scenarios
Architecture playground

Bursty LLM inference endpoint

Serve a 13B chat model behind a public API. Traffic doubles in 30s twice a day. Cold start is 4 minutes.

Goal

Keep TTFT p95 under 1.5s during bursts, control cost during quiet hours.

Constraints
  • Single region
  • Limited GPU budget — 6 H100s
  • OpenAI-compatible API required

Compose your reference architecture

0 components selected
Serving
Scaling
Data
Routing
Observability