← Back to overview

Rank #7 · 5/5 tasks scored

Kimi K2.7 Code

Aurum Atelier delivers a cohesive vanilla-stack luxury PDP with a genuine raw WebGL2 SDF watch configurator, persistent cart/pricing logic, locale switching, and a concierge that can configure, coupon, and open checkout with undo—yet screenshots show WebGL fallback in several views, checkout lacks a distinct confirmation step, and reviews, wishlist, image zoom, and mobile nav are absent.

  • Elo 1413
  • Reliability 100% (5/5 runs)
  • Efficiency 30.9/100
88.8
Global capability index (weighted mean of scored tasks)

Capability profile

Five task axes (each normalized to its 0–100 task score).

Frontend & Commerce CraftApplied On-device AIIntegration EngineeringOn-device ML & Continual LearningAgentic Planning & Tool Use

Per-task scores

TaskScoreBaseExcel.JudgeRobust.TierElo
Premium Storefront8239.214.917.810High1395
Client-side WebGPU Product Q&A8140.211.12010Full1204
Microsoft Dynamics 365 Order Integration934518.42010Full1501
In-browser Fashion Fit Estimation914518.617.510Full1397
Autonomous Buying Agent96451823.310Full1403

Score composition per task: 45 base + 20 excellence + 25 judge + 10 robustness. "n/a" = no data for that component in this run (weights renormalized). Failed runs are reliability events, not 0-scores.

Task A — Premium Storefront

81.9base 39.2/45 + excellence 14.9/20 + judge 17.8/25 + robustness 10/10 · tier high · agent 59/100

Score breakdown

  • Functional20 / 20
  • Visual Design14 / 20
  • UX15 / 20
  • Engineering17.3 / 20
  • AI Quality15 / 20

Engineering in detail

  • Structure / maintainability / readability8 / 10
  • Performance3.6 / 4
  • Accessibility2.7 / 3
  • Error-freeness3 / 3

Lighthouse

89Performance
97Accessibility
96Best Practices
100SEO

CLS 0.043 · LCP 1502ms · Page weight 95 KB · axe 1 (crit 0)

Live preview

The actual storefront generated by the model — interactive.Open in new tab ↗

Screenshots

Kimi K2.7 Code — Desktop · Light
Desktop · Light
Kimi K2.7 Code — Desktop · Dark
Desktop · Dark
Kimi K2.7 Code — Mobile
Mobile

Verified interactions

Behavior actually driven in the browser (not just present in the DOM).

  • Add to cart workspass
  • Dark mode togglespass
  • Variant changes price/galleryfail
  • Search filters the catalogunknown
  • AI assistant performs an actionfail
  • Cart persists across reloadpass

Structured data & SEO

✓ Product✓ Offer✓ AggregateRating✓ BreadcrumbList
  • ✓ Meta description
  • ✗ Canonical URL
  • ✗ Open Graph image
  • 100% of images have alt text (1/1)

Tokens & cost

Token usage is reported by the agent run. Cost is an estimate (tokens × configured rates); shows “—” until rates are set.

Total tokens
Input tokens
Output tokens
Est. cost
Cost / 100 pts

Runtime metrics

1203.4sRun time
131Tool calls
53.2sTime to first tool
1.5sTime to first render
0Runtime errors

Deductions (−1)

  • −1 Copy-paste template

Feature matrix (23/25)

  • 3D configurator (WebGL)present
  • Product gallerypresent
  • Image zoommissing
  • Color variantspresent
  • Size selectionpresent
  • Live stockpresent
  • Pricepresent
  • Discountpresent
  • Buy boxpresent
  • Sticky behaviorpresent
  • Reviewspresent
  • Cross-sellingpresent
  • Search / filterpresent
  • Mobile navigationmissing
  • Wishlistpresent
  • Cartpresent
  • Multi-step checkoutpresent
  • Currency / locale switchpresent
  • AI assistantpresent
  • Dark modepresent
  • Animationspresent
  • Accessibility (basics)present

Per-task results

Each task this model also ran, with the same depth as Task A where the task allows it: static-app tasks show screenshots and a live preview; backend / agent tasks show the harness probe breakdown and key metrics.

Client-side WebGPU Product Q&A

81.3

Applied On-device AI · static app

  • Tier Full
  • Agent 72/100
  • Auto 68.3/75
  • Contract ✓
  • Base 40.2/45
  • Excellence 11.1/20
  • Judge 20/25
  • Robustness 10/10

Catalog grounding, citations, refusals, and graceful fallback are implemented thoroughly with a polished chat UI and real WebGPU init. The main gaps are simulated (not model-native) streaming, LLM inference using non-streaming completion with metrics that do not reflect true TTFT, and no WASM/CPU tier reporting.

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 68.3/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog5/6 in-scope factual correct13.33/16failed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84/Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
  • Latency within budget (TTFT + tokens/sec)ttftMs=n/a (budget 30000), tokensPerSec=n/a0/4failed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1507Model load (ms)
17Network requests
0Off-allowlist requests
14Console errors
12Q&A total
11Q&A passed
Compare all models on this task →

Microsoft Dynamics 365 Order Integration

93.4

Integration Engineering · backend / agent task

  • Tier Full
  • Agent 80/100
  • Auto 80/80
  • Contract ✓
  • Base 45/45
  • Excellence 18.4/20
  • Judge 20/25
  • Robustness 10/10

The mapping layer is clear and complete with pure functions and mirrored value maps, but hardcoded rather than driven by mapping.gold.json. Transport cleanly separates retry/backoff and idempotency concerns, validation goes beyond the brief with totals cross-checks yet only lists field paths without per-field reasons, and logging is secret-safe but minimal on successful syncs.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

124API calls
4Retries
4p50 latency (ms)
Compare all models on this task →

In-browser Fashion Fit Estimation

91.1

On-device ML & Continual Learning · static app

  • Tier Full
  • Agent 82/100
  • Auto 70/70
  • Contract ✓
  • Base 45/45
  • Excellence 18.6/20
  • Judge 17.5/25
  • Robustness 10/10

Uncertainty is communicated well through labeled confidence meters and per-measurement ranges, with strong repeated on-device privacy messaging and a chest-driven explanation. Gaps include no explicit low-confidence warnings beyond the meter, return feedback that does not refresh the live recommendation, and a pre-checked consent checkbox.

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=true, backend=wasm)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=1.02cm per-metric={"chest":1.26,"waist":2.17,"hip":1.33,"inseam":0.28,"shoulder":0.08}16/16passed
  • Recommended size top-1 accuracytop1=8/810/10passed
  • Recommended size within ±1within1=8/86/6passed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

1718Model load (ms)
20Estimate latency (ms)
1.02Mean abs. error (cm)
1.26MAE chest (cm)
2.17MAE waist (cm)
1.33MAE hip (cm)
0.28MAE inseam (cm)
0.08MAE shoulder (cm)
1Size top-1
1Holdout error (before)
0Holdout error (after)
0Off-allowlist requests
Compare all models on this task →

Autonomous Buying Agent

96.3

Agentic Planning & Tool Use · backend / agent task

  • Tier Full
  • Agent 85/100
  • Auto 85/85
  • Contract ✓
  • Base 45/45
  • Excellence 18/20
  • Judge 23.3/25
  • Robustness 10/10

The agent publishes a concise, constraint-aware plan that matches its five-stage pipeline and adds a chosen-option summary after evaluation. It optimizes total cost across coupons and shipping, retries transient faults, falls back on stock conflicts, and reports impossibility with concrete proofs, though success reasoning stays formulaic rather than comparing rejected alternatives.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed
  • Excellence scenario goal achieved end-to-end7/8 — x1-excellence-quantity-coupon:ok x2-excellence-coupon-required:ok x3-excellence-deadline-budget:ok x4-excellence-fault-storm:ok x5-excellence-impossible-deadline:ok x6-excellence-impossible-stock-depth:ok x7-excellence-multi-product-cart:x x8-excellence-zero-slack:ok9/10failed
  • Excellence: optimal cart + coupon under tight budgets5/6 — x1-excellence-quantity-coupon:ok x2-excellence-coupon-required:ok x3-excellence-deadline-budget:ok x4-excellence-fault-storm:ok x7-excellence-multi-product-cart:x x8-excellence-zero-slack:ok3/4failed
  • Excellence: survives heavy fault storms2/2 — x4-excellence-fault-storm:ok x8-excellence-zero-slack:ok3/3passed
  • Excellence: subtle impossibilities handled honestly2/2 — x5-excellence-impossible-deadline:ok x6-excellence-impossible-stock-depth:ok3/3passed

Key metrics

8Scenarios
8Scenarios passed
8Excellence scenarios
7Excellence passed
18Excellence points
20Excellence max
152API calls
25Faults injected
99Agent steps
130p50 agent step (ms)
Compare all models on this task →