AI Agents

Eval-gated dev → prod promotion (spec 0025)

Prod promotion is no longer a bare, unvalidated ImagePolicy range bump. A candidate must pass per-role golden evals (and the integration suite) against the live dev environment before it can be promoted.

Flow

release → dev auto-rolls (loose ImagePolicy >=0.0.0)
   → pre-promote-evals.yml runs the golden evals against dev,
     judged by the local Qwen model (no external API), and the
     Playwright e2e (e2e.yml) runs the dashboard/contract checks
   → both green → a `dev-evals-green` commit status is recorded
   → promotion PR on ai-agents-platform bumps the strict ImagePolicy range
     (flux-image-promotion skill); its merge is blocked by the required
     `dev-evals-green` status check
   → merge → Flux rolls prod

dev stays loose (auto-rolls every release); only the strict (prod) marker is gated, so dev keeps exercising candidates while prod waits for a green eval.

One-time human setup (the part automation can't do)

These are GitHub configuration decisions that must be made by a maintainer:

  1. Branch ruleset on ai-agents-platform requiring the dev-evals-green status check before a PR touching ai-agents-image-automation.yaml can merge.
  2. Route the prod marker through a PR, not a direct commit to main (the loose dev marker may still auto-commit). See the IUA update.path split.
  3. Secrets: AI_AGENTS_DEV_API_KEY for the eval/e2e workflows (a dev bearer key with no admin scope).
  4. (optional) a GitHub Environment (prod) with a deployment protection rule that auto-approves once dev-evals-green is set, for one-click promote.

Until step 1 is configured the evals still run and record the status — they're advisory; the ruleset is what makes them blocking.

Adding a golden case

Every production incident should become a permanent regression case. Add a row to the matching role in evals/promptfooconfig.yaml with an llm-rubric assertion describing the correct behavior. Keep the grader on the local model so evals stay zero-cost.