The fastest way to look something up — or remember why you cared in the first place.
API server, scheduler, controllers — the brains.
Resource requests meet node capacity, with GPU labels.
Queue → batcher → GPU executor → token stream.
Virtual memory for GPU KV cache.
InferenceService over ServingRuntime.
Weights as a sidecar OCI image.
Hardware shapes as separate pools.
Slice one GPU into hardware-isolated instances.
Split each layer across GPUs.
Split layers across GPUs as a pipeline.
Demand spikes faster than cold starts.
Sticky to the replica with your KV cache.
Prefill and decode on different workers.
OpenTelemetry into Prometheus, logs, traces.
First-token latency vs. per-token streaming.
Tiny low-rank weight updates.
All workers admit together, or none.
Pack workers onto fast interconnect.
Retrieve, augment, generate.
Plan, act, observe, decide.
Per-tool identity, scoped delegation.
Pod, node, zone, region — pick yours.