AI-driven apps: RAG and agents
1 / 9
Agent state, checkpointing, and failure domainsHow to keep long-running agents alive across pod restarts
Framing

Agents are stateful, the pod is not

An agent halfway through a 12-step workflow has expensive state: tool results, partial plans, retrieved documents. If the pod dies, you cannot afford to start from scratch — and you cannot afford to repeat side effects either.