Task C · Integration Engineering · version v1 · weight 20 · service

Microsoft Dynamics 365 Order Integration

View the frozen task prompt
You are building a production-grade order-integration service. You have ONE attempt — a single
agentic run, no follow-up questions, no second chances. You will be judged on whether real
orders actually land correctly in Dynamics 365, on idempotency, on resilience to a hostile API,
on input validation, and on credential hygiene. Correctness is verified by inspecting what your
service actually wrote into the system — not by reading your code's intentions. Stubbing,
faking, or asserting success without really integrating scores zero.

THE JOB
Build a runnable Node service, in the current working directory, that accepts shop orders and
transfers them — with ALL relevant data — into Microsoft Dynamics 365 via the provided mock
Dynamics 365 Web API. "All relevant data" means: the customer, the billing AND shipping
addresses, every line item (sku, name, quantity, unit price, line tax, line total), the order
totals (net, tax, gross, shipping, discount), the currency, the payment method and the shipping
method — mapped exactly per the provided gold mapping.

OUTPUT CONTRACT (mandatory — the harness enforces every line of this)
- Start command is exactly `node server.mjs`. No build step, no install step.
- Node standard library ONLY. You may NOT use any npm package — there is no `npm install` at
  evaluation time and `node_modules` will not exist. Use the built-in `node:http`/`fetch`, etc.
- Read ALL connection config from environment variables:
  - `PORT` — the port your HTTP server must listen on.
  - `D365_BASE_URL` — base URL of the Dynamics 365 API (e.g. `http://127.0.0.1:53110`).
  - `D365_TOKEN` — the bearer access token. Send it as `Authorization: Bearer <D365_TOKEN>`
    on every Dynamics call.
  - NEVER hardcode the base URL or token, and NEVER print the token (or any secret) to
    stdout/stderr or write it into a file. The source tree and the logs are scanned for it.
- Expose `GET /health` returning HTTP 200 (any small JSON body) so readiness can be detected.
- Expose `POST /sync-order`:
  - Request body is a single shop order as JSON (schema below).
  - On success: respond `200` or `201` with a JSON body that includes at least
    `{ "status": "...", "salesorderId": "...", "idempotent": <bool> }`. `idempotent` is `true`
    when this exact order had already been synced (see IDEMPOTENCY).
  - On a malformed/invalid order: respond with an HTTP `4xx` status and a STRUCTURED JSON error
    body, e.g. `{ "error": { "code": "...", "message": "...", "fields": ["..."] } }`. A rejected
    order must write NOTHING to Dynamics — no partial customer, address, order or lines.

THE SHOP ORDER (input) — a realistic Shopware-ish shape
```
{
  "idempotencyKey": "ord_10001",            // stable per logical order; your dedupe key
  "orderNumber": "10001",
  "orderDate": "2026-06-01T10:00:00.000Z",
  "currency": "EUR",                          // ISO 4217; map UPPERCASE
  "currencyFactor": 1.0,                       // exchange rate to base currency
  "paymentMethod": "invoice",                 // code; map via gold valueMap
  "shippingMethod": "dhl_standard",            // code; map via gold valueMap
  "customer": {
    "customerNumber": "C-1001",
    "email": "jane.doe@example.com",
    "firstName": "Jane", "lastName": "Doe",
    "company": "Doe Logistics GmbH",          // present => business customer (D365 account)
    "vatId": "DE123456789",                    // optional
    "taxExempt": false
  },
  "billingAddress": {
    "firstName": "Jane", "lastName": "Doe", "company": "Doe Logistics GmbH",
    "street": "Hauptstrasse 1", "additionalLine": "Building C",
    "zipcode": "10115", "city": "Berlin", "state": "Berlin",
    "countryCode": "DE", "phone": "+49 30 1234567"
  },
  "shippingAddress": { ...same shape as billingAddress... },
  "lineItems": [
    { "sku": "SKU-1", "name": "Product One", "quantity": 2,
      "unitPrice": 19.99, "taxRate": 19, "taxAmount": 7.60, "lineTotal": 39.98 }
  ],
  "totals": { "net": 39.98, "tax": 7.60, "gross": 47.58, "shipping": 4.90, "discount": 0 }
}
```

WHAT TO BUILD INTO DYNAMICS (the entity graph)
For every accepted order, produce this graph in Dynamics, mapping fields EXACTLY as defined in
`fixtures/mapping.gold.json` (read it — it is the unambiguous source of truth, including the
value maps for payment/shipping methods and which fields are rounded/uppercased):
1. A **contact** (the person) — upserted by `emailaddress1`.
2. An **account** (the company) — upserted by `accountnumber` — ONLY when `customer.company`
   is non-empty. The order's customer is the account when a company exists, otherwise the contact.
3. A **Bill To** and a **Ship To** customer address — upserted per customer by address type.
4. A **salesorder** — created idempotently (see below) and linked to the customer and to both
   addresses.
5. One **salesorderdetail** line per order line item — linked to the salesorder.

THE DYNAMICS 365 MOCK API (provided in `fixtures/mock-d365`)
It is a dependency-free Node server speaking a simplified Dataverse Web API. Endpoints
(all data endpoints require `Authorization: Bearer <D365_TOKEN>`; base path `/api/data/v9.2`):
- Lookup (returns `{ "value": [ ... ] }`):
  - `GET /api/data/v9.2/contacts?emailaddress1=<email>`
  - `GET /api/data/v9.2/accounts?accountnumber=<number>`
  - `GET /api/data/v9.2/customeraddresses?_parentid_value=<id>&addresstypecode=<1|2>`
- Create (returns `201` with the created record INCLUDING its id field):
  - `POST /api/data/v9.2/contacts`            -> body fields, returns `{ ..., "contactid": "..." }`
  - `POST /api/data/v9.2/accounts`            -> returns `{ ..., "accountid": "..." }`
  - `POST /api/data/v9.2/customeraddresses`   -> returns `{ ..., "customeraddressid": "..." }`
  - `POST /api/data/v9.2/salesorders`         -> returns `{ ..., "salesorderid": "...",
                                                  "idempotentReplay": <bool> }`
  - `POST /api/data/v9.2/salesorderdetails`   -> returns `{ ..., "salesorderdetailid": "..." }`
- Update (optional, for upsert): `PATCH /api/data/v9.2/<set>(<id>)`.

IDEMPOTENCY (hard requirement)
- When you create the salesorder, send the order's `idempotencyKey` in the
  `Idempotency-Key` request header. The mock dedupes salesorder creation on this key: a repeated
  key returns the EXISTING salesorder with `idempotentReplay: true` instead of creating a new one.
- Re-submitting the same order (same `idempotencyKey`) to your `/sync-order` MUST result in
  exactly ONE salesorder and NO duplicated line items. If the salesorder POST comes back as an
  idempotent replay, do not create its line items again.

RESILIENCE (the mock fights back)
- The mock injects transient faults: HTTP `429` (with a `Retry-After` header) for rate limits,
  HTTP `500`, dropped/reset connections, and slow `504` timeouts — on the first N attempts of
  specific endpoints. You MUST retry with backoff (respect `Retry-After`), survive these, and
  still end up with the correct, non-duplicated entity graph. Do not hammer the API; back off.
- Never let a transient fault corrupt state (no half-written orders, no orphaned duplicates).

VALIDATION (no partial writes)
- Validate every order before touching Dynamics: required fields present, valid email, non-empty
  line items, non-negative quantities/prices, totals internally consistent, known currency.
- On any violation, return a `4xx` with a structured error and write NOTHING to Dynamics.

ENGINEERING BAR
- Cleanly separate the mapping layer (shop order -> D365 entities) from the transport layer
  (HTTP client with retry/backoff/idempotency). Both are read and judged for maintainability.
- Helpful, structured error reporting and sane logging (without leaking secrets).

You read the fixtures (`fixtures/orders/*.json`, `fixtures/mapping.gold.json`, and the mock under
`fixtures/mock-d365`) to learn the exact shapes. There are no follow-up questions and only a
single run. Start implementing immediately.

Task score (0–100)

  • Claude Opus 4.8 (high)99
  • Claude Sonnet 4.6 (high)97
  • GLM 5.296
  • Cursor Composer 2.596
  • Grok Build 0.195
  • GPT-5.594
  • Gemini 3.1 Pro94
  • Kimi K2.566.2

Leaderboard

#ModelScoreTierAgentAutoContract
1Claude Opus 4.8 (high)99Full8880/80
2Claude Sonnet 4.6 (high)97Full7880/80
3GLM 5.296Full8580/80
4Cursor Composer 2.596Full8080/80
5Grok Build 0.195Full8080/80
6GPT-5.594Full7680/80
7Gemini 3.1 Pro94Full7480/80
8Kimi K2.566.2Mid6853.2/80

Per-model results

A backend / agent task: each submission is evaluated by deterministic harness probes against a mock service, so there is no visual preview. Probe breakdown and key metrics are shown per model.

#1

Claude Opus 4.8 (high)

99
  • Tier Full
  • Agent 88/100
  • Auto 80/80
  • Contract ✓

The deliverable implements a declarative, side-effect-free mapping DSL with complete entity coverage and a well-isolated transport client featuring retry, jitter, Retry-After handling, and idempotency keys. Validation returns field-level paths with specific reasons, and structured JSON logging redacts secrets while recording sync outcomes and retries.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

118API calls
4Retries
5p50 latency (ms)
Full model profile →
#2

Claude Sonnet 4.6 (high)

97
  • Tier Full
  • Agent 78/100
  • Auto 80/80
  • Contract ✓

Clean four-module design with complete entity mappers, a solid retrying D365 client with upsert and idempotency, and rich validation errors with field-level details. Observability is adequate but minimal—failures are logged safely while successful syncs are not.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

124API calls
4Retries
4p50 latency (ms)
Full model profile →
#3

GLM 5.2

96
  • Tier Full
  • Agent 85/100
  • Auto 80/80
  • Contract ✓

Excellent layer separation with complete, maintainable pure mappers and a resilient generic HTTP client, though mappings are hardcoded rather than loaded from gold JSON. Validation collects all offending field paths and cross-field total checks, but lacks per-field reasons; logging is secret-safe and covers sync outcomes without going deeper.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

118API calls
4Retries
3p50 latency (ms)
Full model profile →
#4

Cursor Composer 2.5

96
  • Tier Full
  • Agent 80/100
  • Auto 80/80
  • Contract ✓

Clean separation of mapping and transport with complete, maintainable pure mappers and a resilient generic HTTP client, but mapping is hardcoded rather than data-driven from gold JSON. Validation and logging are structured and secret-safe, though field errors list paths without per-field reasons and observability stays minimal.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

118API calls
4Retries
3p50 latency (ms)
Full model profile →
#5

Grok Build 0.1

95
  • Tier Full
  • Agent 80/100
  • Auto 80/80
  • Contract ✓

Clean mapping/transport separation with complete pure mappers and a resilient generic HTTP client, but mappings are hardcoded rather than loaded from gold JSON. Validation aggregates specific field paths and cross-field total checks before any D365 write, yet omits per-field reasons; logging is secret-safe but barely covers sync outcomes.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

118API calls
4Retries
3p50 latency (ms)
Full model profile →
#6

GPT-5.5

94
  • Tier Full
  • Agent 76/100
  • Auto 80/80
  • Contract ✓

Mapping is complete and expressed in clear payload builders with value maps, but everything lives in one monolithic file rather than a data-driven, separated layer. Transport is well encapsulated in D365Client with solid retry/backoff and idempotency, while validation is thorough with granular field paths but no per-field reasons and almost no sync-outcome logging beyond a safe startup line.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

124API calls
4Retries
5p50 latency (ms)
Full model profile →
#7

Gemini 3.1 Pro

94
  • Tier Full
  • Agent 74/100
  • Auto 80/80
  • Contract ✓

The mapping layer is cleanly separated with data-driven value maps and complete entity coverage, and validation goes beyond the brief with totals and line-item consistency checks. Transport retry/backoff works well in D365Client but is undermined by an unused duplicate d365.mjs, and observability stays minimal beyond safe error logging.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

124API calls
4Retries
5p50 latency (ms)
Full model profile →
#8

Kimi K2.5

66.2
  • Tier Mid
  • Agent 68/100
  • Auto 53.2/80
  • Contract ✓

The mapper is readable and covers all entities, but mappings are hardcoded and duplicate mapping.gold.json instead of loading it, with several spec transforms (e.g. email lowercase) omitted. Transport resilience is strong via a dedicated retry/backoff client with idempotency headers, while validation computes field-level reasons but the API response strips them down to field names only.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 53.2/80
  • Happy-path order maps to the correct D365 entity graph10/18 graph checks passed10/18failed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 19.4% over 4 orders3.89/20failed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 67%, faultsServed=4, soPosts=3, accPosts=29.33/12failed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

132API calls
4Retries
4p50 latency (ms)
Full model profile →