Agent Lifecycle State Machines: 12 States of a Digital Employee

Managing a single AI agent is easy. Managing a fleet of them requires a lifecycle system that can handle every state an agent might be in, from initial creation through active operation to eventual teardown. The lifecycle manager in AgenticMail Enterprise models this as a 12 state finite state machine, and every transition is deliberate.

The 12 States

Here’s the full lifecycle:

Draft: Agent has been defined but not configured. Think of it as a job requisition that hasn’t been filled yet.
Configuring: Agent is being set up with its capabilities, credentials, and initial context.
Provisioning: External resources are being created (email accounts, calendar access, API keys, phone numbers).
Ready: Fully provisioned and waiting to be activated.
Starting: Boot sequence in progress. The agent’s LLM context is being initialized and tools are loading.
Active: Running and processing work. This is the steady state for a healthy agent.
Paused: Temporarily stopped. The agent retains all state but isn’t processing new work.
Degraded: Running but experiencing issues. Some capabilities may be unavailable.
Recovering: The system detected failures and is attempting auto recovery.
Suspended: Administratively stopped. Unlike paused, this requires explicit human action to resume.
Deprovisioning: External resources are being torn down. Email accounts closed, API keys revoked.
Destroying: Final cleanup. All data associated with the agent is being archived or deleted.

Not every agent passes through every state. A simple agent might go from draft straight to provisioning, and a healthy agent might never see degraded or recovering. But the states exist because production systems encounter every edge case eventually.

State Transitions

Each transition between states has explicit triggers and guards. You can’t jump from draft to active; you have to pass through the intermediate states.

Moving from configuring to provisioning requires that all mandatory fields are populated. Moving from ready to starting requires that the underlying LLM provider is reachable. Moving from suspended to active requires an admin action with an audit log entry.

The state machine rejects invalid transitions with descriptive errors. If someone tries to start a suspended agent via the API, they get back a clear message explaining that suspended agents require explicit administrative resumption.

30 Second Health Checks

Every active agent gets a health check every 30 seconds. The check verifies three things: process liveness (is the runtime still responding?), capability health (are critical tools still connected?), and budget status (has the agent exceeded its token or cost budget?).

If a health check fails, the agent transitions from active to degraded. A temporary network blip causing an email connection drop? The system attempts to reconnect, and if successful, the agent returns to active. A hard crash of the agent runtime? Straight to recovering.

Auto Recovery After 5 Consecutive Failures

The recovering state implements an automatic recovery protocol. The system attempts to restart the agent with a clean context. If the restart succeeds and subsequent health checks pass, the agent transitions back to active.

But if recovery fails five consecutive times, the system gives up and transitions the agent to suspended. At that point, a human needs to investigate. Each recovery attempt is logged with full context: what failed, what the system tried, and why it didn’t work.

Budget Enforcement

Budget checks are part of the health check cycle, not a separate system. Each agent can have token budgets (max tokens per hour, per day, per month) and cost budgets (max spend in dollars over the same intervals).

When an agent approaches its budget limit (80% threshold), it transitions to degraded with a budget warning. When the budget is fully exhausted, the agent transitions to paused until the budget resets or an admin increases the allocation. This prevents runaway costs from a misbehaving agent stuck in a retry loop.

Why a Formal State Machine

I could have modeled agent status as a simple enum with ad hoc transition logic scattered across the codebase. That works until you have 50 agents and need to answer “why did this agent stop working at 3am?”

The formal state machine gives you a single source of truth for agent status, validated transitions that prevent impossible states, and a complete audit trail. That’s worth the upfront complexity of modeling it properly.

Source Code

The lifecycle module defines the 12 states and runs health checks every 30 seconds, transitioning agents automatically when failures are detected:

export type AgentState =
  | 'draft' | 'configuring' | 'ready' | 'provisioning'
  | 'deploying' | 'starting' | 'running' | 'degraded'
  | 'stopped' | 'error' | 'updating' | 'destroying';

async function runHealthCheck(agent: ManagedAgent): Promise<HealthResult> {
  const processAlive = await checkProcessLiveness(agent.pid);
  const capsHealthy = await checkCapabilities(agent.id);
  const withinBudget = await checkBudgetStatus(agent.id);
  return {
    healthy: processAlive && capsHealthy && withinBudget,
    processAlive,
    capsHealthy,
    withinBudget,
    checkedAt: Date.now(),
  };
}

View the full source on GitHub