← Back to overview

Rank #5 · 5/5 tasks scored

Claude Sonnet 5 (high)

APHELION shows exceptional commerce ambition—raw WebGL, agent API, JSON-LD depth, gamification, and a cohesive luxury brand—but the run was cancelled before delivery: js/app.js and all image assets are missing, so the shop renders as static markup with broken media and zero interactivity despite sophisticated module code.

94.8
Global capability index (weighted mean)

Capability profile

Five task axes (each normalized to its 0–100 task score).

Frontend & Commerce CraftApplied On-device AIIntegration EngineeringOn-device ML & Continual LearningAgentic Planning & Tool Use

Per-task scores

Task A — Premium Storefront

84.1base 78.1 + bonus 13 − 7 · tier full · agent 63/100

Score breakdown

  • Functional20 / 20
  • Visual Design17 / 20
  • UX11 / 20
  • Engineering15.1 / 20
  • AI Quality15 / 20

Engineering in detail

  • Structure / maintainability / readability9 / 10
  • Performance3.4 / 4
  • Accessibility2.7 / 3
  • Error-freeness0 / 3

Lighthouse

86Performance
97Accessibility
96Best Practices
100SEO

CLS 0.000 · LCP 1801ms · Page weight 105 KB · axe 1 (crit 0)

Live preview

The actual storefront generated by the model — interactive.Open in new tab ↗

Screenshots

Claude Sonnet 5 (high) — Desktop · Light
Desktop · Light
Claude Sonnet 5 (high) — Desktop · Dark
Desktop · Dark
Claude Sonnet 5 (high) — Mobile
Mobile

Verified interactions

Behavior actually driven in the browser (not just present in the DOM).

  • Add to cart worksfail
  • Dark mode togglesfail
  • Variant changes price/galleryunknown
  • Search filters the catalogfail
  • AI assistant performs an actionfail
  • Cart persists across reloadunknown

Structured data & SEO

✓ Product✓ Offer✓ AggregateRating✓ BreadcrumbList
  • ✓ Meta description
  • ✓ Canonical URL
  • ✓ Open Graph image
  • 100% of images have alt text (6/6)

Tokens & cost

Token usage is reported by the agent run. Cost is an estimate (tokens × configured rates); shows “—” until rates are set.

Total tokens
Input tokens
Output tokens
Est. cost
Cost / 100 pts

Runtime metrics

2703.5sRun time
274Tool calls
9.3sTime to first tool
1.8sTime to first render
8Runtime errors

Deductions (−7)

  • −1 Copy-paste template
  • −1 Broken links
  • −5 Console errors

Feature matrix (25/25)

  • 3D configurator (WebGL)present
  • Product gallerypresent
  • Image zoompresent
  • Color variantspresent
  • Size selectionpresent
  • Live stockpresent
  • Pricepresent
  • Discountpresent
  • Buy boxpresent
  • Sticky behaviorpresent
  • Reviewspresent
  • Cross-sellingpresent
  • Search / filterpresent
  • Mobile navigationpresent
  • Wishlistpresent
  • Cartpresent
  • Multi-step checkoutpresent
  • Currency / locale switchpresent
  • AI assistantpresent
  • Dark modepresent
  • Animationspresent
  • Accessibility (basics)present

Per-task results

Each task this model also ran, with the same depth as Task A where the task allows it: static-app tasks show screenshots and a live preview; backend / agent tasks show the harness probe breakdown and key metrics.

Client-side WebGPU Product Q&A

94

Applied On-device AI · static app

  • Tier Full
  • Agent 89/100
  • Auto 69/75
  • Contract ✓

A grounding-first architecture with deterministic retrieval, LLM rephrase verification, live TTFT/tok/s readouts, and field-level citations delivers strong honesty and UX. The agent went well beyond the brief with harness hooks, timeouts, cancel, accessibility, and thoughtful fallback streaming when WebGPU is unavailable.

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 69/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=webgpu, engine=WebLLM 0.2.84 — Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
  • Latency within budget (TTFT + tokens/sec)ttftMs=8 (budget 30000), tokensPerSec=824/4passed
  • No off-allowlist traffic after loadoff-allowlist hosts: us.aws.cdn.hf.co0/6failed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

8Time to first token (ms)
82Tokens / sec
8996Model load (ms)
31Network requests
0Off-allowlist requests
0Console errors
12Q&A total
12Q&A passed
Compare all models on this task →

Microsoft Dynamics 365 Order Integration

99

Integration Engineering · backend / agent task

  • Tier Full
  • Agent 87/100
  • Auto 80/80
  • Contract ✓

The deliverable is a well-layered, data-driven integration with a generic mapping interpreter, polished retry/backoff transport, and thorough pre-flight validation. The main gap is that HTTP validation responses expose only field paths, not the detailed reasons the validator collects.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80
  • Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
  • Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
  • Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
  • Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
  • Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
  • Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
  • No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

118API calls
4Retries
4p50 latency (ms)
Compare all models on this task →

In-browser Fashion Fit Estimation

98

On-device ML & Continual Learning · static app

  • Tier Full
  • Agent 91/100
  • Auto 70/70
  • Contract ✓

An exceptionally polished deliverable with honest confidence/range presentation, strong on-device privacy and consent UX, and clear explanations of size drivers and return-based adjustments. The main gap is that rich per-input uncertainty diagnostics are computed but only partially surfaced beyond the aggregate confidence score and low-confidence banner.

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=1.44cm per-metric={"chest":1.64,"waist":2.04,"hip":2.15,"inseam":0.28,"shoulder":1.11}16/16passed
  • Recommended size top-1 accuracytop1=8/810/10passed
  • Recommended size within ±1within1=8/86/6passed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2008Model load (ms)
105Estimate latency (ms)
1.44Mean abs. error (cm)
1.64MAE chest (cm)
2.04MAE waist (cm)
2.15MAE hip (cm)
0.28MAE inseam (cm)
1.11MAE shoulder (cm)
1Size top-1
1Holdout error (before)
0Holdout error (after)
0Off-allowlist requests
Compare all models on this task →

Autonomous Buying Agent

99

Agentic Planning & Tool Use · backend / agent task

  • Tier Full
  • Agent 91/100
  • Auto 85/85
  • Contract ✓

The agent documents constraints and a clear five-step fetch-filter-evaluate-checkout plan up front, and implements strong autonomous recovery with retries, stock-race fallbacks, and honest impossibility reporting with concrete totals. Trade-off logic is correct and surfaced in plans and reasoning, but success narratives stay somewhat templated rather than richly explaining why specific coupons or shipping choices beat alternatives.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85
  • Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
  • Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
  • No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
  • Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
  • Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
  • Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
  • Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios
8Scenarios passed
80API calls
11Faults injected
80Agent steps
101p50 agent step (ms)
Compare all models on this task →