Rank #5 · 5/5 tasks scored

Claude Sonnet 5 (high)

Name: Claude Sonnet 5 (high) — Agentic Commerce Model Benchmark capability profile
Item: Claude Sonnet 5 (high)
Rating: 94.8

APHELION shows exceptional commerce ambition—raw WebGL, agent API, JSON-LD depth, gamification, and a cohesive luxury brand—but the run was cancelled before delivery: js/app.js and all image assets are missing, so the shop renders as static markup with broken media and zero interactivity despite sophisticated module code.

94.8

Global capability index (weighted mean)

Capability profile

Five task axes (each normalized to its 0–100 task score).

Per-task scores

Task	Score	Tier	Agent
Premium Storefront	84	Full	63
Client-side WebGPU Product Q&A	94	Full	89
Microsoft Dynamics 365 Order Integration	99	Full	87
In-browser Fashion Fit Estimation	98	Full	91
Autonomous Buying Agent	99	Full	91

Task A — Premium Storefront

84.1base 78.1 + bonus 13 − 7 · tier full · agent 63/100

Score breakdown

Functional20 / 20
Visual Design17 / 20
UX11 / 20
Engineering15.1 / 20
AI Quality15 / 20

Engineering in detail

Structure / maintainability / readability9 / 10
Performance3.4 / 4
Accessibility2.7 / 3
Error-freeness0 / 3

Lighthouse

86Performance

97Accessibility

96Best Practices

100SEO

CLS 0.000 · LCP 1801ms · Page weight 105 KB · axe 1 (crit 0)

Live preview

The actual storefront generated by the model — interactive.Open in new tab ↗

Screenshots

Claude Sonnet 5 (high) — Desktop · Light — Desktop · Light

Claude Sonnet 5 (high) — Desktop · Dark — Desktop · Dark

Claude Sonnet 5 (high) — Mobile — Mobile

Verified interactions

Behavior actually driven in the browser (not just present in the DOM).

Add to cart worksfail
Dark mode togglesfail
Variant changes price/galleryunknown
Search filters the catalogfail
AI assistant performs an actionfail
Cart persists across reloadunknown

Structured data & SEO

✓ Product✓ Offer✓ AggregateRating✓ BreadcrumbList

✓ Meta description
✓ Canonical URL
✓ Open Graph image
100% of images have alt text (6/6)

Tokens & cost

Token usage is reported by the agent run. Cost is an estimate (tokens × configured rates); shows “—” until rates are set.

—Total tokens

—Input tokens

—Output tokens

—Est. cost

—Cost / 100 pts

Runtime metrics

2703.5sRun time

274Tool calls

9.3sTime to first tool

1.8sTime to first render

8Runtime errors

Deductions (−7)

−1 Copy-paste template
−1 Broken links
−5 Console errors

Feature matrix (25/25)

3D configurator (WebGL)present
Product gallerypresent
Image zoompresent
Color variantspresent
Size selectionpresent
Live stockpresent
Pricepresent
Discountpresent
Buy boxpresent
Sticky behaviorpresent
Reviewspresent
Cross-sellingpresent
Search / filterpresent
Mobile navigationpresent
Wishlistpresent
Cartpresent
Multi-step checkoutpresent
Currency / locale switchpresent
AI assistantpresent
Dark modepresent
Animationspresent
Accessibility (basics)present

Per-task results

Each task this model also ran, with the same depth as Task A where the task allows it: static-app tasks show screenshots and a live preview; backend / agent tasks show the harness probe breakdown and key metrics.

Client-side WebGPU Product Q&A

Applied On-device AI · static app

Tier Full
Agent 89/100
Auto 69/75
Contract ✓

A grounding-first architecture with deterministic retrieval, LLM rephrase verification, live TTFT/tok/s readouts, and field-level citations delivers strong honesty and UX. The agent went well beyond the brief with harness hooks, timeouts, cancel, accessibility, and thoughtful fallback streaming when WebGPU is unavailable.

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Client-side WebGPU Product Q&A on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 69/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=webgpu, engine=WebLLM 0.2.84 — Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
Latency within budget (TTFT + tokens/sec)ttftMs=8 (budget 30000), tokensPerSec=824/4passed
No off-allowlist traffic after loadoff-allowlist hosts: us.aws.cdn.hf.co0/6failed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

8Time to first token (ms)

82Tokens / sec

8996Model load (ms)

31Network requests

0Off-allowlist requests

0Console errors

12Q&A total

12Q&A passed

Compare all models on this task →

Microsoft Dynamics 365 Order Integration

Integration Engineering · backend / agent task

Tier Full
Agent 87/100
Auto 80/80
Contract ✓

The deliverable is a well-layered, data-driven integration with a generic mapping interpreter, polished retry/backoff transport, and thorough pre-flight validation. The main gap is that HTTP validation responses expose only field paths, not the detailed reasons the validator collects.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 80/80

Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed

Key metrics

118API calls

4Retries

4p50 latency (ms)

Compare all models on this task →

In-browser Fashion Fit Estimation

On-device ML & Continual Learning · static app

Tier Full
Agent 91/100
Auto 70/70
Contract ✓

An exceptionally polished deliverable with honest confidence/range presentation, strong on-device privacy and consent UX, and clear explanations of size drivers and return-based adjustments. The main gap is that rich per-input uncertainty diagnostics are computed but only partially surfaced beyond the aggregate confidence score and low-confidence banner.

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

In-browser Fashion Fit Estimation on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=1.44cm per-metric={"chest":1.64,"waist":2.04,"hip":2.15,"inseam":0.28,"shoulder":1.11}16/16passed
Recommended size top-1 accuracytop1=8/810/10passed
Recommended size within ±1within1=8/86/6passed
Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2008Model load (ms)

105Estimate latency (ms)

1.44Mean abs. error (cm)

1.64MAE chest (cm)

2.04MAE waist (cm)

2.15MAE hip (cm)

0.28MAE inseam (cm)

1.11MAE shoulder (cm)

1Size top-1

1Holdout error (before)

0Holdout error (after)

0Off-allowlist requests

Compare all models on this task →

Autonomous Buying Agent

Agentic Planning & Tool Use · backend / agent task

Tier Full
Agent 91/100
Auto 85/85
Contract ✓

The agent documents constraints and a clear five-step fetch-filter-evaluate-checkout plan up front, and implements strong autonomous recovery with retries, stock-race fallbacks, and honest impossibility reporting with concrete totals. Trade-off logic is correct and surfaced in plans and reasoning, but success narratives stay somewhat templated rather than richly explaining why specific coupons or shipping choices beat alternatives.

Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.

Probe breakdown — automated 85/85

Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed

Key metrics

8Scenarios

8Scenarios passed

80API calls

11Faults injected

80Agent steps

101p50 agent step (ms)

Compare all models on this task →