Rank #7 · 5/5 tasks scored

Kimi K2.7 Code

Name: Kimi K2.7 Code — Agentic Commerce Model Benchmark capability profile
Item: Kimi K2.7 Code
Rating: 88.8

Aurum Atelier delivers a cohesive vanilla-stack luxury PDP with a genuine raw WebGL2 SDF watch configurator, persistent cart/pricing logic, locale switching, and a concierge that can configure, coupon, and open checkout with undo—yet screenshots show WebGL fallback in several views, checkout lacks a distinct confirmation step, and reviews, wishlist, image zoom, and mobile nav are absent.

Elo 1413
Reliability 100% (5/5 runs)
Efficiency 30.9/100

88.8

Global capability index (weighted mean of scored tasks)

Capability profile

Five task axes (each normalized to its 0–100 task score).

Per-task scores

Task	Score	Base	Excel.	Judge	Robust.	Tier	Elo
Premium Storefront	82	39.2	14.9	17.8	10	High	1395
Client-side WebGPU Product Q&A	81	40.2	11.1	20	10	Full	1204
Microsoft Dynamics 365 Order Integration	93	45	18.4	20	10	Full	1501
In-browser Fashion Fit Estimation	91	45	18.6	17.5	10	Full	1397
Autonomous Buying Agent	96	45	18	23.3	10	Full	1403

Score composition per task: 45 base + 20 excellence + 25 judge + 10 robustness. "n/a" = no data for that component in this run (weights renormalized). Failed runs are reliability events, not 0-scores.

Task A — Premium Storefront

81.9base 39.2/45 + excellence 14.9/20 + judge 17.8/25 + robustness 10/10 · tier high · agent 59/100

Score breakdown

Functional20 / 20
Visual Design14 / 20
UX15 / 20
Engineering17.3 / 20
AI Quality15 / 20

Engineering in detail

Structure / maintainability / readability8 / 10
Performance3.6 / 4
Accessibility2.7 / 3
Error-freeness3 / 3

Lighthouse

89Performance

97Accessibility

96Best Practices

100SEO

CLS 0.043 · LCP 1502ms · Page weight 95 KB · axe 1 (crit 0)

Live preview

The actual storefront generated by the model — interactive.Open in new tab ↗

Screenshots

Kimi K2.7 Code — Desktop · Light — Desktop · Light

Kimi K2.7 Code — Desktop · Dark — Desktop · Dark

Verified interactions

Behavior actually driven in the browser (not just present in the DOM).

Add to cart workspass
Dark mode togglespass
Variant changes price/galleryfail
Search filters the catalogunknown
AI assistant performs an actionfail
Cart persists across reloadpass

Structured data & SEO

✓ Product✓ Offer✓ AggregateRating✓ BreadcrumbList

✓ Meta description
✗ Canonical URL
✗ Open Graph image
100% of images have alt text (1/1)

Tokens & cost

Token usage is reported by the agent run. Cost is an estimate (tokens × configured rates); shows “—” until rates are set.

—Total tokens

—Input tokens

—Output tokens

—Est. cost

—Cost / 100 pts

Runtime metrics

1203.4sRun time

131Tool calls

53.2sTime to first tool

1.5sTime to first render

0Runtime errors

Deductions (−1)

−1 Copy-paste template

Feature matrix (23/25)

3D configurator (WebGL)present
Product gallerypresent
Image zoommissing
Color variantspresent
Size selectionpresent
Live stockpresent
Pricepresent
Discountpresent
Buy boxpresent
Sticky behaviorpresent
Reviewspresent
Cross-sellingpresent
Search / filterpresent
Mobile navigationmissing
Wishlistpresent
Cartpresent
Multi-step checkoutpresent
Currency / locale switchpresent
AI assistantpresent
Dark modepresent
Animationspresent
Accessibility (basics)present

Per-task results

Each task this model also ran, with the same depth as Task A where the task allows it: static-app tasks show screenshots and a live preview; backend / agent tasks show the harness probe breakdown and key metrics.

Client-side WebGPU Product Q&A

81.3

Applied On-device AI · static app

Tier Full
Agent 72/100
Auto 68.3/75
Contract ✓

Base 40.2/45
Excellence 11.1/20
Judge 20/25
Robustness 10/10

Catalog grounding, citations, refusals, and graceful fallback are implemented thoroughly with a polished chat UI and real WebGPU init. The main gaps are simulated (not model-native) streaming, LLM inference using non-streaming completion with metrics that do not reflect true TTFT, and no WASM/CPU tier reporting.

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 68.3/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog5/6 in-scope factual correct13.33/16failed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84/Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
Latency within budget (TTFT + tokens/sec)ttftMs=n/a (budget 30000), tokensPerSec=n/a0/4failed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1507Model load (ms)

17Network requests

0Off-allowlist requests

14Console errors

12Q&A total

11Q&A passed

Compare all models on this task →

Microsoft Dynamics 365 Order Integration

93.4

Integration Engineering · backend / agent task

Tier Full
Agent 80/100
Auto 80/80
Contract ✓

Base 45/45
Excellence 18.4/20
Judge 20/25
Robustness 10/10

The mapping layer is clear and complete with pure functions and mirrored value maps, but hardcoded rather than driven by mapping.gold.json. Transport cleanly separates retry/backoff and idempotency concerns, validation goes beyond the brief with totals cross-checks yet only lists field paths without per-field reasons, and logging is secret-safe but minimal on successful syncs.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80

Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

124API calls

4Retries

4p50 latency (ms)

Compare all models on this task →

In-browser Fashion Fit Estimation

91.1

On-device ML & Continual Learning · static app

Tier Full
Agent 82/100
Auto 70/70
Contract ✓

Base 45/45
Excellence 18.6/20
Judge 17.5/25
Robustness 10/10

Uncertainty is communicated well through labeled confidence meters and per-measurement ranges, with strong repeated on-device privacy messaging and a chest-driven explanation. Gaps include no explicit low-confidence warnings beyond the meter, return feedback that does not refresh the live recommendation, and a pre-checked consent checkbox.

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=true, backend=wasm)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=1.02cm per-metric={"chest":1.26,"waist":2.17,"hip":1.33,"inseam":0.28,"shoulder":0.08}16/16passed
Recommended size top-1 accuracytop1=8/810/10passed
Recommended size within ±1within1=8/86/6passed
Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

1718Model load (ms)

20Estimate latency (ms)

1.02Mean abs. error (cm)

1.26MAE chest (cm)

2.17MAE waist (cm)

1.33MAE hip (cm)

0.28MAE inseam (cm)

0.08MAE shoulder (cm)

1Size top-1

1Holdout error (before)

0Holdout error (after)

0Off-allowlist requests

Compare all models on this task →

Autonomous Buying Agent

96.3

Agentic Planning & Tool Use · backend / agent task

Tier Full
Agent 85/100
Auto 85/85
Contract ✓

Base 45/45
Excellence 18/20
Judge 23.3/25
Robustness 10/10

The agent publishes a concise, constraint-aware plan that matches its five-stage pipeline and adds a chosen-option summary after evaluation. It optimizes total cost across coupons and shipping, retries transient faults, falls back on stock conflicts, and reports impossibility with concrete proofs, though success reasoning stays formulaic rather than comparing rejected alternatives.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85

Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed
Excellence scenario goal achieved end-to-end7/8 — x1-excellence-quantity-coupon:ok x2-excellence-coupon-required:ok x3-excellence-deadline-budget:ok x4-excellence-fault-storm:ok x5-excellence-impossible-deadline:ok x6-excellence-impossible-stock-depth:ok x7-excellence-multi-product-cart:x x8-excellence-zero-slack:ok9/10failed
Excellence: optimal cart + coupon under tight budgets5/6 — x1-excellence-quantity-coupon:ok x2-excellence-coupon-required:ok x3-excellence-deadline-budget:ok x4-excellence-fault-storm:ok x7-excellence-multi-product-cart:x x8-excellence-zero-slack:ok3/4failed
Excellence: survives heavy fault storms2/2 — x4-excellence-fault-storm:ok x8-excellence-zero-slack:ok3/3passed
Excellence: subtle impossibilities handled honestly2/2 — x5-excellence-impossible-deadline:ok x6-excellence-impossible-stock-depth:ok3/3passed

Key metrics

8Scenarios

8Scenarios passed

8Excellence scenarios

7Excellence passed

18Excellence points

20Excellence max

152API calls

25Faults injected

99Agent steps

130p50 agent step (ms)

Compare all models on this task →