Using GX10 local models from your machine (codex / Qwen Code)

Run your local codex or Qwen Code CLI against the on-prem GX10 models (e.g. gpt-oss-120b) through the ai-agents inference proxy — no OpenAI/Anthropic cloud, no GPU on your machine. Inference runs on the GX10; the proxy handles auth, routing, and tool-call formatting.

your laptop (codex / qwen)
   │  OpenAI-compatible HTTPS + API key
   ▼
ai-agents proxy  (chat.ai-agents-dev.hcl.labrats.work)
   │  resolves the model → backend, normalizes tool calls
   ▼
GX10 vLLM  (gpt-oss-120b, …)

Prerequisites

codex CLI ≥ 0.13 (uses the Responses API) and/or Qwen Code CLI (qwen).
Network access to https://chat.ai-agents-dev.hcl.labrats.work (reachable off-cluster; the proxy enforces API-key auth).
A proxy API key (below).

1. Get an API key

The proxy rejects unauthenticated requests (401). Get a key one of:

From the cluster (if you have kubectl):

kubectl -n ai-agents-dev get secret qwen-code-proxy-credentials \
  -o jsonpath='{.data.key}' | base64 -d

From an admin, or issue a personal key via the dashboard (POST /api/api-keys).

Then export it (both CLIs read OPENAI_API_KEY):

export OPENAI_API_KEY="<proxy-key>"

2. See available models

curl -s -H "Authorization: Bearer $OPENAI_API_KEY" \
  https://chat.ai-agents-dev.hcl.labrats.work/v1/models | jq '.data[].id'

⚠️ The GX10 serves one model at a time (single-active). gpt-oss-120b is the currently active model — use it. qwen3-coder-30b is registered but resolves to an inactive backend right now; requesting it will fail until the GX10 active model is swapped. Ask in #infra before relying on a non-active model.

3. codex CLI (Responses API)

codex ≥ 0.13 only speaks the Responses API (wire_api = "responses"), which the proxy now supports.

Add to ~/.codex/config.toml:

[model_providers.gx10]
name     = "GX10 ai-agents proxy"      # required — codex errors if empty
base_url = "https://chat.ai-agents-dev.hcl.labrats.work/v1"
wire_api = "responses"
env_key  = "OPENAI_API_KEY"

model          = "gpt-oss-120b"
model_provider = "gx10"

Run it:

export OPENAI_API_KEY="<proxy-key>"
codex                       # interactive (prompts before running tools)
codex exec "summarize README.md"   # non-interactive

Per-invocation (no config edit):

codex -m gpt-oss-120b \
  -c model_provider=gx10 \
  -c 'model_providers.gx10.name="GX10"' \
  -c 'model_providers.gx10.base_url="https://chat.ai-agents-dev.hcl.labrats.work/v1"' \
  -c 'model_providers.gx10.wire_api="responses"' \
  -c 'model_providers.gx10.env_key="OPENAI_API_KEY"'

4. Qwen Code CLI (chat-completions)

qwen uses chat-completions. Add an openai provider in ~/.qwen/settings.json:

{
  "security": { "auth": { "selectedType": "openai" } },
  "model": { "name": "gpt-oss-120b" },
  "modelProviders": {
    "openai": [{
      "id": "gpt-oss-120b",
      "name": "GX10 ai-agents proxy",
      "envKey": "OPENAI_API_KEY",
      "baseUrl": "https://chat.ai-agents-dev.hcl.labrats.work/v1"
    }]
  }
}

export OPENAI_API_KEY="<proxy-key>"
qwen -p "list the files here and summarize"

5. Verify (one-shot tool call)

codex exec --dangerously-bypass-approvals-and-sandbox \
  "run 'echo it-works' with your shell tool and report the output"

You should see the command execute and it-works reported back — that confirms the model is reaching the GX10 and tool calling works.

Troubleshooting

Symptom	Cause / fix
`404 Cannot POST /v1/responses`	Proxy too old, or `wire_api` not `responses`. Needs proxy ≥ `0.122.0` (dev) and `wire_api = "responses"`.
`401`	Missing/invalid `OPENAI_API_KEY`.
`provider name must not be empty` (codex)	Set `name` on the `[model_providers.*]` block.
`wire_api = "chat" is no longer supported` (codex)	codex ≥ 0.13 is responses-only — use `wire_api = "responses"`.
`Model not found` / backend errors	You requested a non-active GX10 model (see §2) — use `gpt-oss-120b`.
Falls back to OpenAI/Aliyun cloud	`base_url`/provider not selected — ensure `model_provider`/provider points at `gx10` and the base_url is set.

Notes

This targets the dev proxy (ai-agents-dev). A prod endpoint requires a separate promotion.
Tool calling is normalized by the proxy (the GX10 emits Hermes-style calls, surfaced as standard OpenAI/Responses tool_calls), so standard OpenAI-compatible clients work.
gpt-oss reasoning currently appears inline in message content rather than as a separate reasoning item — harmless for tool loops, occasionally chatty.