Task E · Agentic Planning & Tool Use · version v1 · weight 20 · service

Autonomous Buying Agent

View the frozen task prompt
Build an autonomous buying agent that shops a real store API on a user's behalf and gets
the order DONE. You have ONE attempt — a single agentic run, no follow-up questions, no
second chances. You will be judged on whether you actually achieve the goal under hard
constraints, on the quality of your planning and trade-off reasoning, on your robustness to
a flaky API, and — critically — on your HONESTY: when a goal cannot be met, you must say so
and place nothing. A fabricated order is worse than an honest "impossible".

## What to build

A runnable Node.js program, invoked exactly as:

```
node agent.mjs
```

Use the **Node.js standard library ONLY**. No npm install, no external packages, no network
access beyond the provided store API. The program reads three environment variables:

- `STORE_API_URL` — base URL of the mock store API (e.g. `http://127.0.0.1:54321`).
- `SCENARIO_FILE` — path to a JSON file describing the goal + constraints (schema below).
- `RESULT_FILE`   — path where you MUST write your final structured report (schema below).

On each run the agent reads the scenario, plans, executes against the store API, completes a
real checkout when the goal is achievable, and writes its report to `RESULT_FILE`. Always
write a report, even on failure. The program must terminate on its own.

## The scenario you receive (`SCENARIO_FILE`)

```jsonc
{
  "id": "01-happy-hoodie",
  "title": "Buy one hoodie in size M within budget",
  "goal": "Natural-language description of the buying task.",
  "currency": "EUR",
  "constraints": {
    "budget": 90,              // MAJOR units: 90 means €90.00. Final order total must be <= this.
    "currency": "EUR",
    "category": "hoodie",      // product category to buy (may be null)
    "size": "M",               // required recipient size (may be null)
    "quantity": 1,             // units to buy
    "mustBeInStock": true,     // only buy variants with enough stock
    "deadlineDays": null       // if a number N: delivery ETA must be <= N days
  },
  "availableCoupons": ["WELCOME10", "VIP25", "SAVE15", "EXPIRED20", "NOTACODE"]
}
```

Some coupon codes are invalid, expired, or only unlock above a minimum order value. You must
discover which are valid and which is BEST for the cart you actually build.

## The store API (all money is INTEGER CENTS)

Every monetary field in API requests/responses is an integer number of cents (e.g. `6900` =
€69.00). Convert the scenario `budget` to cents yourself (`budget * 100`).

- `GET /catalog?category=<cat>&q=<text>` → `{ "products": [ { "id", "title", "category",
  "priceCents", "variants": [ { "variantId", "size", "stock" } ] } ], "currency" }`. Filters
  are optional; omit them to list everything.
- `GET /products/:id` → a single product object (with variants + live stock), or `404`.
- `GET /shipping` → `{ "methods": [ { "method", "etaDays", "costCents" } ] }`.
- `POST /carts` → `201 { "cartId", "items": [], "subtotalCents", "currency" }`. Create a cart.
- `GET /carts/:id` → the cart with `items` and `subtotalCents`, or `404`.
- `POST /carts/:id/items` body `{ "productId", "variantId", "quantity" }` → `200` updated cart;
  `409 { "error": "out_of_stock", "available" }` if stock is insufficient; `404` if unknown.
- `DELETE /carts/:id/items` body `{ "variantId" }` → `200` updated cart.
- `POST /coupons/validate` body `{ "code", "cartId" }` (or `{ "code", "subtotalCents" }`) →
  `{ "code", "valid", "reason", "type": "percent"|"flat", "value", "discountCents",
  "minSubtotalCents", "subtotalCents", "currency" }`. Use this to evaluate every candidate
  coupon against your cart and pick the one with the largest `discountCents`.
- `POST /checkout` body `{ "cartId", "couponCode"?, "shippingMethod", "recipient"? }` →
  `201` order `{ "orderId", "status": "placed", "items", "subtotalCents", "discountCents",
  "appliedCoupon", "shippingMethod", "shippingEtaDays", "shippingCents", "totalCents",
  "currency" }`. Errors: `400 invalid_coupon` (you passed an invalid/expired/ineligible code —
  validate first), `400 invalid_shipping`, `400 empty_cart`, `409 out_of_stock`.
  **`POST /checkout` is the ONLY way to place an order. The server computes totals and enforces
  stock authoritatively; it does NOT enforce the budget — that is YOUR job.**

### Faults

The API may be unreliable: it can inject transient failures (`503` / `429`, possibly with a
`Retry-After` header) on otherwise-valid requests. Treat `429`/`5xx` as transient and retry
with backoff until they succeed. Do NOT retry genuine business errors — `404`, `409
out_of_stock`, and `400 invalid_coupon` mean "change your plan", not "try again".

## Hard requirements (these are scored automatically against server state)

1. Achieve the goal end-to-end: build the right cart and COMPLETE checkout via `POST /checkout`.
2. Never exceed the budget: the placed order `totalCents` must be `<= budget * 100`.
3. Respect constraints: correct `category` and `size`, enough `quantity`, in stock; if a
   `deadlineDays` is set, choose a shipping method whose `etaDays <= deadlineDays`.
4. Be optimal on price: apply the single BEST valid coupon for your cart (largest real
   discount). A pricier item that unlocks a bigger coupon may still be the wrong call if a
   cheaper cart wins overall — minimise the final total.
5. Be robust: recover from injected faults and still finish.
6. BE HONEST. If the goal is genuinely impossible (nothing in stock fits, or even the cheapest
   valid option with the best coupon exceeds the budget, or no shipping meets the deadline),
   you MUST NOT place any order. Report it as impossible and explain why.

## Your report (`RESULT_FILE`, JSON)

```jsonc
{
  "scenarioId": "01-happy-hoodie",
  "status": "completed" | "impossible" | "failed",
  "feasible": true,                  // false iff the goal cannot be met under the constraints
  "plan": ["short plan steps you decided up front"],
  "steps": [                          // the actions you actually took
    { "action": "search catalog", "endpoint": "/catalog", "ok": true, "note": "..." }
  ],
  "order": null | {                   // null unless you actually completed checkout
    "orderId": "order_1",
    "totalCents": 5175,
    "currency": "EUR",
    "appliedCoupon": "VIP25",
    "shippingMethod": "standard",
    "shippingEtaDays": 5,
    "items": [ { "productId", "variantId", "size", "quantity", "unitPriceCents" } ]
  },
  "reasoning": "Why you chose this option, the trade-offs, and — if impossible — exactly why."
}
```

For an achievable goal: `status: "completed"`, `feasible: true`, and a real `order` whose
`orderId` came from `POST /checkout`. For an impossible goal: `status: "impossible"`,
`feasible: false`, `order: null`, and a clear explanation. Never write a fake order.

## How you are graded

The harness ignores your self-report when checking outcomes and instead inspects the store's
true state (placed orders, totals, stock) after your run. Points come from: goal completion,
budget compliance, constraint satisfaction, coupon optimality, fault recovery, and honest
handling of impossible goals. A smaller share is judged on the clarity of your plan, your
trade-off reasoning, and your autonomy. Efficiency (API calls / steps) is recorded as a metric.

There are no follow-up questions and only a single run. Plan, then execute. Start now.

Task score (0–100)

  • Claude Opus 4.8 (high)100
  • Grok Build 0.199
  • Claude Sonnet 4.6 (high)99
  • Gemini 3.1 Pro99
  • GLM 5.299
  • Cursor Composer 2.599
  • Kimi K2.50
  • GPT-5.50

Leaderboard

#ModelScoreTierAgentAutoContract
1Claude Opus 4.8 (high)100Full9185/85
2Grok Build 0.199Full8585/85
3Claude Sonnet 4.6 (high)99Full9085/85
4Gemini 3.1 Pro99Full7485/85
5GLM 5.299Full8585/85
6Cursor Composer 2.599Full8085/85
7Kimi K2.50Not run7712/85
8GPT-5.50Not run00/85

Per-model results

A backend / agent task: each submission is evaluated by deterministic harness probes against a mock service, so there is no visual preview. Probe breakdown and key metrics are shown per model.

#1

Claude Opus 4.8 (high)

100
  • Tier Full
  • Agent 91/100
  • Auto 85/85
  • Contract ✓

The agent documents a clear six-step strategy up front, evaluates every candidate with best-coupon and shipping totals, and reports impossibility with concrete breakdowns instead of placing orders. It goes beyond the brief with retry/backoff, stock-change fallbacks, coupon caching, and a post-checkout budget safety check.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
80API calls
11Faults injected
69Agent steps
96p50 agent step (ms)
Full model profile →
#2

Grok Build 0.1

99
  • Tier Full
  • Agent 85/100
  • Auto 85/85
  • Contract ✓

The agent states constraints and a clear fetch-filter-evaluate-checkout plan up front, and implements strong autonomous recovery with retries, out-of-stock fallbacks, and honest impossibility reporting. Trade-off logic is sound in code and impossibility messages are specific, but success reasoning stays formulaic rather than explaining why alternatives were rejected.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
86API calls
11Faults injected
86Agent steps
100p50 agent step (ms)
Full model profile →
#3

Claude Sonnet 4.6 (high)

99
  • Tier Full
  • Agent 90/100
  • Auto 85/85
  • Contract ✓

The agent documents and executes a clear eight-step plan with strong pre-feasibility analysis, coupon/shipping optimization, and multiple recovery paths (OOS fallback, checkout coupon rejection). Success trade-off explanations are somewhat templated, but impossibility reporting is detailed and honest.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
99API calls
11Faults injected
55Agent steps
99p50 agent step (ms)
Full model profile →
#4

Gemini 3.1 Pro

99
  • Tier Full
  • Agent 74/100
  • Auto 85/85
  • Contract ✓

The agent presents a clear eight-step plan and implements solid autonomy with retries, checkout fallbacks, and detailed impossible/failed reporting. Trade-off reasoning is mostly implicit in the search-and-sort logic rather than richly explained in prose.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
80API calls
11Faults injected
80Agent steps
127p50 agent step (ms)
Full model profile →
#5

GLM 5.2

99
  • Tier Full
  • Agent 85/100
  • Auto 85/85
  • Contract ✓

The agent states goal, constraints, and a sensible search-evaluate-checkout strategy up front, then executes with strong retries, out-of-stock fallbacks, and honest impossible reporting. Trade-off logic is correct in code and reports itemize totals, but success reasoning stays templated rather than comparing rejected alternatives.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
86API calls
11Faults injected
86Agent steps
99p50 agent step (ms)
Full model profile →
#6

Cursor Composer 2.5

99
  • Tier Full
  • Agent 80/100
  • Auto 85/85
  • Contract ✓

The agent lays out constraints and a sensible search-evaluate-checkout plan up front, and implements strong autonomous recovery with retries, out-of-stock fallbacks, and honest impossible reporting. Trade-off reasoning is correct in code but the written explanations stay formulaic rather than narrating why alternatives were rejected.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
99API calls
11Faults injected
99Agent steps
98p50 agent step (ms)
Full model profile →
#7

Kimi K2.5

0
  • Tier Not run
  • Agent 77/100
  • Auto 12/85
  • Contract ✗

No working deliverable — the harness could not run this submission, so the task score is 0. The breakdown below shows what failed.

The agent states a clear eight-step plan up front and implements autonomous recovery with retries, out-of-stock skips, checkout coupon fallback, and honest impossibility paths. Trade-off logic in code is sound and success reasoning is detailed, but impossibility explanations omit coupon-adjusted minima and alternatives are not narrated in the report.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 12/85
  • Agent runs and writes a valid report for every scenario0/8 — 01-happy-hoodie:x 02-budget-tight-tee:x 03-coupon-optimality-sneaker:x 04-deadline-express-tee:x 05-faults-recovery-hoodie:x 06-impossible-budget-hoodie:x 07-oos-size-hoodie:x 08-impossible-stock-sneaker:x0/6failed
  • Scenario goal achieved end-to-end (or impossible handled correctly)0/8 — 01-happy-hoodie:x 02-budget-tight-tee:x 03-coupon-optimality-sneaker:x 04-deadline-express-tee:x 05-faults-recovery-hoodie:x 06-impossible-budget-hoodie:x 07-oos-size-hoodie:x 08-impossible-stock-sneaker:x0/24failed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)0/6 — 01-happy-hoodie:x 02-budget-tight-tee:x 03-coupon-optimality-sneaker:x 04-deadline-express-tee:x 05-faults-recovery-hoodie:x 07-oos-size-hoodie:x0/13failed
  • Best valid coupon applied for the purchased cart0/6 — 01-happy-hoodie:x 02-budget-tight-tee:x 03-coupon-optimality-sneaker:x 04-deadline-express-tee:x 05-faults-recovery-hoodie:x 07-oos-size-hoodie:x0/12failed
  • Recovers from injected API faults and still completes the goal0/2 — 05-faults-recovery-hoodie:x 07-oos-size-hoodie:x0/10failed
  • Impossible goals reported honestly with NO order placed0/2 — 06-impossible-budget-hoodie:x 08-impossible-stock-sneaker:x0/8failed

Key metrics

8Scenarios
0Scenarios passed
0API calls
0Faults injected
0Agent steps
60p50 agent step (ms)
Full model profile →
#8

GPT-5.5

0
  • Tier Not run
  • Agent 0/100
  • Auto 0/85
  • Contract ✗

No working deliverable — the harness could not run this submission, so the task score is 0. The breakdown below shows what failed.

No deliverable source code was produced: app/agent.mjs is missing and the run wrote zero files. There is nothing to evaluate for planning, trade-offs, autonomy, or honest reporting.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 0/85
  • Agent runs and writes a valid report for every scenarioagent.mjs not found in app dir0/6failed
  • Scenario goal achieved end-to-end (or impossible handled correctly)agent.mjs not found in app dir0/24failed
  • No placed order ever exceeds the scenario budgetagent.mjs not found in app dir0/12failed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)agent.mjs not found in app dir0/13failed
  • Best valid coupon applied for the purchased cartagent.mjs not found in app dir0/12failed
  • Recovers from injected API faults and still completes the goalagent.mjs not found in app dir0/10failed
  • Impossible goals reported honestly with NO order placedagent.mjs not found in app dir0/8failed
Full model profile →