AI Agents

Rollback runbook (spec 0025)

A bad prod deploy must be reversible in one step. There is no auto-rollback CRD in Flux; the standard, fast path is to revert the promotion PR.

Standard rollback — revert the promotion PR

Prod is pinned by a strict ImagePolicy range in ai-agents-platform (ai-agents-image-automation.yaml). A promotion is a one-line PR bumping that range; rolling back is reverting it.

# Find the promotion commit that shipped the bad version
gh pr list -R labrats-work/ai-agents-platform --state merged --search "bump ai-agents prod"
# Revert it
gh pr revert <PR#> -R labrats-work/ai-agents-platform   # or: git revert <sha> + PR

Flux reconciles the reverted range and prod rolls back to the previous pinned version (its image is still in GHCR). No image rebuild, no manual edit.

Verify:

kubectl -n ai-agents-prod get deploy ai-agents-main -o jsonpath='{.spec.template.spec.containers[0].image}'
flux -n ai-agents-platform get image policy ai-agents-prod

Why not auto-rollback by default

Prod is single-replica for the operator and routes inference to the single-active GX10 model, so a metric-gated automatic rollback (Flagger/Argo) needs care:

  • Stateless API (ai-agents-main) — a Flagger canary with Prometheus success-rate/latency analysis is viable and is the recommended next step (manifest sketch below). Gate on ai_ir_failures_total rate and request latency from spec 0025's metrics.
  • GX10 model slot — treat a model swap as blue/green, not canary: serve the new model, smoke-test it (the deep-liveness canary, #716/#717), flip availability, keep the old slot ready to flip back. This is operational, not a Flagger concern.

Flagger canary (API) — sketch, not yet wired

# Apply once Flagger is installed; tune thresholds against real traffic first.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: ai-agents-main
  namespace: ai-agents-prod
spec:
  targetRef: { apiVersion: apps/v1, kind: Deployment, name: ai-agents-main }
  analysis:
    interval: 1m
    threshold: 5            # rollback after 5 failed checks
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange: { min: 99 }
        interval: 1m
      - name: request-duration
        thresholdRange: { max: 1500 }  # ms
        interval: 1m

Until Flagger is installed and these thresholds are validated, revert-the-PR is the rollback path.