Self-healing remediation loop (spec 0025)
The capstone that composes the spec-0025 phases into one loop:
detect → open issue → auto-PR → test on dev → promote
How the pieces connect
| Step | Mechanism | Phase |
|---|---|---|
| detect | deep-liveness canary (#716/#717) + failure metrics (ai_ir_failures_total, #731) | 1, 6 |
| react | self-heal controller restarts / queues / recovers (#718, #719, #720, #722) | 2, 3 |
| escalate → issue | controller opens/labels a self-heal issue when restarts exceed the daily cap, or in observe-only mode (#718) | 2 |
| auto-PR | self-healing-remediation.yml turns the issue into a fix PR on fix/self-heal-<issue> | 5 (this) |
| test on dev | the fix PR runs unit + integration (#723) + evals (#726) | 4, 5 |
| promote | eval-gated promotion via dev-evals-green (#727) — human-approved | 5 |
Loop-safety (designed in, not bolted on)
The loop can never mask root cause or thrash:
- Trigger is narrow — only the operator's
self-healescalation label fires it. - Never touches main — fixes land on
fix/self-heal-<issue>; the executorcheckout -B mainguard prevents accidental main pushes. - One open PR per fingerprint — deduped on the branch name; a recurring fault doesn't spawn a PR storm.
- Same gate as any change — the fix PR runs the full unit/integration/eval suite; nothing bypasses CI.
- Promotion stays human-gated — the loop opens a PR; it does not merge
or promote.
dev-evals-green+ a maintainer approve the prod bump. - Audit trail — every run comments on the issue.
- Restart cap — the controller escalates to a human rather than looping once the per-model daily restart cap is hit (#718).
Enabling it
- The
self-heallabel exists and the operator is configured to apply it on escalation (setSELF_HEAL_*env; restarts stay observe-only until trusted). AI_AGENTS_API_KEYsecret for the remediation workflow (a scoped key).- Start conservative: keep promotion human-approved (the default here). Tighten only after the loop has demonstrably opened good PRs over several incidents.