Eval-gated dev → prod promotion (spec 0025)
Prod promotion is no longer a bare, unvalidated ImagePolicy range bump. A
candidate must pass per-role golden evals (and the integration suite) against
the live dev environment before it can be promoted.
Flow
release → dev auto-rolls (loose ImagePolicy >=0.0.0)
→ pre-promote-evals.yml runs the golden evals against dev,
judged by the local Qwen model (no external API), and the
Playwright e2e (e2e.yml) runs the dashboard/contract checks
→ both green → a `dev-evals-green` commit status is recorded
→ promotion PR on ai-agents-platform bumps the strict ImagePolicy range
(flux-image-promotion skill); its merge is blocked by the required
`dev-evals-green` status check
→ merge → Flux rolls prod
dev stays loose (auto-rolls every release); only the strict (prod) marker is
gated, so dev keeps exercising candidates while prod waits for a green eval.
One-time human setup (the part automation can't do)
These are GitHub configuration decisions that must be made by a maintainer:
- Branch ruleset on
ai-agents-platformrequiring thedev-evals-greenstatus check before a PR touchingai-agents-image-automation.yamlcan merge. - Route the prod marker through a PR, not a direct commit to
main(the loose dev marker may still auto-commit). See the IUAupdate.pathsplit. - Secrets:
AI_AGENTS_DEV_API_KEYfor the eval/e2e workflows (a dev bearer key with no admin scope). - (optional) a GitHub Environment (
prod) with a deployment protection rule that auto-approves oncedev-evals-greenis set, for one-click promote.
Until step 1 is configured the evals still run and record the status — they're advisory; the ruleset is what makes them blocking.
Adding a golden case
Every production incident should become a permanent regression case. Add a row
to the matching role in evals/promptfooconfig.yaml with an llm-rubric
assertion describing the correct behavior. Keep the grader on the local model so
evals stay zero-cost.