Rollback runbook (spec 0025)
A bad prod deploy must be reversible in one step. There is no auto-rollback CRD in Flux; the standard, fast path is to revert the promotion PR.
Standard rollback — revert the promotion PR
Prod is pinned by a strict ImagePolicy range in ai-agents-platform
(ai-agents-image-automation.yaml). A promotion is a one-line PR bumping that
range; rolling back is reverting it.
# Find the promotion commit that shipped the bad version
gh pr list -R labrats-work/ai-agents-platform --state merged --search "bump ai-agents prod"
# Revert it
gh pr revert <PR#> -R labrats-work/ai-agents-platform # or: git revert <sha> + PR
Flux reconciles the reverted range and prod rolls back to the previous pinned version (its image is still in GHCR). No image rebuild, no manual edit.
Verify:
kubectl -n ai-agents-prod get deploy ai-agents-main -o jsonpath='{.spec.template.spec.containers[0].image}'
flux -n ai-agents-platform get image policy ai-agents-prod
Why not auto-rollback by default
Prod is single-replica for the operator and routes inference to the single-active GX10 model, so a metric-gated automatic rollback (Flagger/Argo) needs care:
- Stateless API (
ai-agents-main) — a Flagger canary with Prometheus success-rate/latency analysis is viable and is the recommended next step (manifest sketch below). Gate onai_ir_failures_totalrate and request latency from spec 0025's metrics. - GX10 model slot — treat a model swap as blue/green, not canary: serve the new model, smoke-test it (the deep-liveness canary, #716/#717), flip availability, keep the old slot ready to flip back. This is operational, not a Flagger concern.
Flagger canary (API) — sketch, not yet wired
# Apply once Flagger is installed; tune thresholds against real traffic first.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: ai-agents-main
namespace: ai-agents-prod
spec:
targetRef: { apiVersion: apps/v1, kind: Deployment, name: ai-agents-main }
analysis:
interval: 1m
threshold: 5 # rollback after 5 failed checks
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange: { min: 99 }
interval: 1m
- name: request-duration
thresholdRange: { max: 1500 } # ms
interval: 1m
Until Flagger is installed and these thresholds are validated, revert-the-PR is the rollback path.