Self-healing remediation loop (spec 0025)

The capstone that composes the spec-0025 phases into one loop:

detect → open issue → auto-PR → test on dev → promote

How the pieces connect

Step	Mechanism	Phase
detect	deep-liveness canary (#716/#717) + failure metrics (`ai_ir_failures_total`, #731)	1, 6
react	self-heal controller restarts / queues / recovers (#718, #719, #720, #722)	2, 3
escalate → issue	controller opens/labels a `self-heal` issue when restarts exceed the daily cap, or in observe-only mode (#718)	2
auto-PR	`self-healing-remediation.yml` turns the issue into a fix PR on `fix/self-heal-<issue>`	5 (this)
test on dev	the fix PR runs unit + integration (#723) + evals (#726)	4, 5
promote	eval-gated promotion via `dev-evals-green` (#727) — human-approved	5

The loop can never mask root cause or thrash:

Trigger is narrow — only the operator's self-heal escalation label fires it.
Never touches main — fixes land on fix/self-heal-<issue>; the executor checkout -B main guard prevents accidental main pushes.
One open PR per fingerprint — deduped on the branch name; a recurring fault doesn't spawn a PR storm.
Same gate as any change — the fix PR runs the full unit/integration/eval suite; nothing bypasses CI.
Promotion stays human-gated — the loop opens a PR; it does not merge or promote. dev-evals-green + a maintainer approve the prod bump.
Audit trail — every run comments on the issue.
Restart cap — the controller escalates to a human rather than looping once the per-model daily restart cap is hit (#718).

The self-heal label exists and the operator is configured to apply it on escalation (set SELF_HEAL_* env; restarts stay observe-only until trusted).
AI_AGENTS_API_KEY secret for the remediation workflow (a scoped key).
Start conservative: keep promotion human-approved (the default here). Tighten only after the loop has demonstrably opened good PRs over several incidents.