AI Agents

Self-healing remediation loop (spec 0025)

The capstone that composes the spec-0025 phases into one loop:

detect → open issue → auto-PR → test on dev → promote

How the pieces connect

StepMechanismPhase
detectdeep-liveness canary (#716/#717) + failure metrics (ai_ir_failures_total, #731)1, 6
reactself-heal controller restarts / queues / recovers (#718, #719, #720, #722)2, 3
escalate → issuecontroller opens/labels a self-heal issue when restarts exceed the daily cap, or in observe-only mode (#718)2
auto-PRself-healing-remediation.yml turns the issue into a fix PR on fix/self-heal-<issue>5 (this)
test on devthe fix PR runs unit + integration (#723) + evals (#726)4, 5
promoteeval-gated promotion via dev-evals-green (#727) — human-approved5

Loop-safety (designed in, not bolted on)

The loop can never mask root cause or thrash:

  • Trigger is narrow — only the operator's self-heal escalation label fires it.
  • Never touches main — fixes land on fix/self-heal-<issue>; the executor checkout -B main guard prevents accidental main pushes.
  • One open PR per fingerprint — deduped on the branch name; a recurring fault doesn't spawn a PR storm.
  • Same gate as any change — the fix PR runs the full unit/integration/eval suite; nothing bypasses CI.
  • Promotion stays human-gated — the loop opens a PR; it does not merge or promote. dev-evals-green + a maintainer approve the prod bump.
  • Audit trail — every run comments on the issue.
  • Restart cap — the controller escalates to a human rather than looping once the per-model daily restart cap is hit (#718).

Enabling it

  1. The self-heal label exists and the operator is configured to apply it on escalation (set SELF_HEAL_* env; restarts stay observe-only until trusted).
  2. AI_AGENTS_API_KEY secret for the remediation workflow (a scoped key).
  3. Start conservative: keep promotion human-approved (the default here). Tighten only after the loop has demonstrably opened good PRs over several incidents.