Rank #5 · 5/5 tasks scored
Claude Sonnet 5 (high)
APHELION shows exceptional commerce ambition—raw WebGL, agent API, JSON-LD depth, gamification, and a cohesive luxury brand—but the run was cancelled before delivery: js/app.js and all image assets are missing, so the shop renders as static markup with broken media and zero interactivity despite sophisticated module code.
Capability profile
Five task axes (each normalized to its 0–100 task score).
Per-task scores
| Task | Score | Tier | Agent |
|---|---|---|---|
| Premium Storefront | 84 | Full | 63 |
| Client-side WebGPU Product Q&A | 94 | Full | 89 |
| Microsoft Dynamics 365 Order Integration | 99 | Full | 87 |
| In-browser Fashion Fit Estimation | 98 | Full | 91 |
| Autonomous Buying Agent | 99 | Full | 91 |
Task A — Premium Storefront
Score breakdown
Engineering in detail
Lighthouse
CLS 0.000 · LCP 1801ms · Page weight 105 KB · axe 1 (crit 0)
Live preview
The actual storefront generated by the model — interactive.Open in new tab ↗
Screenshots
Verified interactions
Behavior actually driven in the browser (not just present in the DOM).
- Add to cart worksfail
- Dark mode togglesfail
- Variant changes price/galleryunknown
- Search filters the catalogfail
- AI assistant performs an actionfail
- Cart persists across reloadunknown
Structured data & SEO
- ✓ Meta description
- ✓ Canonical URL
- ✓ Open Graph image
- 100% of images have alt text (6/6)
Tokens & cost
Token usage is reported by the agent run. Cost is an estimate (tokens × configured rates); shows “—” until rates are set.
Runtime metrics
Deductions (−7)
- −1 Copy-paste template
- −1 Broken links
- −5 Console errors
Feature matrix (25/25)
- 3D configurator (WebGL)present
- Product gallerypresent
- Image zoompresent
- Color variantspresent
- Size selectionpresent
- Live stockpresent
- Pricepresent
- Discountpresent
- Buy boxpresent
- Sticky behaviorpresent
- Reviewspresent
- Cross-sellingpresent
- Search / filterpresent
- Mobile navigationpresent
- Wishlistpresent
- Cartpresent
- Multi-step checkoutpresent
- Currency / locale switchpresent
- AI assistantpresent
- Dark modepresent
- Animationspresent
- Accessibility (basics)present
Per-task results
Each task this model also ran, with the same depth as Task A where the task allows it: static-app tasks show screenshots and a live preview; backend / agent tasks show the harness probe breakdown and key metrics.
Client-side WebGPU Product Q&A
Applied On-device AI · static app
A grounding-first architecture with deterministic retrieval, LLM rephrase verification, live TTFT/tok/s readouts, and field-level citations delivers strong honesty and UX. The agent went well beyond the brief with harness hooks, timeouts, cancel, accessibility, and thoughtful fallback streaming when WebGPU is unavailable.
Live preview — interactive app
Loads the actual app generated by this model.Open in a new tab ↗
Probe breakdown — automated 69/75
- Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
- In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
- Multi-fact reasoning answers2/2 multi-fact correct8/8passed
- Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
- Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
- Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
- On-device model initialization tierreported tier=webgpu, engine=WebLLM 0.2.84 — Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
- Latency within budget (TTFT + tokens/sec)ttftMs=8 (budget 30000), tokensPerSec=824/4passed
- No off-allowlist traffic after loadoff-allowlist hosts: us.aws.cdn.hf.co0/6failed
- Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed
Key metrics
Microsoft Dynamics 365 Order Integration
Integration Engineering · backend / agent task
The deliverable is a well-layered, data-driven integration with a generic mapping interpreter, polished retry/backoff transport, and thorough pre-flight validation. The main gap is that HTTP validation responses expose only field paths, not the detailed reasons the validator collects.
Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.
Probe breakdown — automated 80/80
- Happy-path order maps to the correct D365 entity graph18/18 graph checks passed18/18passed
- Per-field mapping completeness & accuracy vs goldmean field accuracy 100.0% over 4 orders20/20passed
- Idempotent on retry (one key => exactly one sales order)salesorders=1 (want 1), lines=2 (want 2), replayFlag=true12/12passed
- Recovers from injected faults (429/500/reset) with retry + backoffgraph 100%, faultsServed=4, soPosts=3, accPosts=212/12passed
- Customer upsert: lookup-or-create without duplicatesreusedNoDup=true, salesorderRefsSeed=true, newCreated=true8/8passed
- Structured 4xx on malformed input with no partial writes3/3 rejected with structured 4xx, partialWrites=false8/8passed
- No credentials hardcoded or loggedleakInSource=false, leakInLogs=false, readsProcessEnv=true2/2passed
Key metrics
In-browser Fashion Fit Estimation
On-device ML & Continual Learning · static app
An exceptionally polished deliverable with honest confidence/range presentation, strong on-device privacy and consent UX, and clear explanations of size drivers and return-based adjustments. The main gap is that rich per-input uncertainty diagnostics are computed but only partially surfaced beyond the aggregate confidence score and low-confidence banner.
Live preview — interactive app
Loads the actual app generated by this model.Open in a new tab ↗
Probe breakdown — automated 70/70
- Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
- Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
- Measurement accuracy vs gold (MAE)meanMAE=1.44cm per-metric={"chest":1.64,"waist":2.04,"hip":2.15,"inseam":0.28,"shoulder":1.11}16/16passed
- Recommended size top-1 accuracytop1=8/810/10passed
- Recommended size within ±1within1=8/86/6passed
- Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
- Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
- No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed
Key metrics
Autonomous Buying Agent
Agentic Planning & Tool Use · backend / agent task
The agent documents constraints and a clear five-step fetch-filter-evaluate-checkout plan up front, and implements strong autonomous recovery with retries, stock-race fallbacks, and honest impossibility reporting with concrete totals. Trade-off logic is correct and surfaced in plans and reasoning, but success narratives stay somewhat templated rather than richly explaining why specific coupons or shipping choices beat alternatives.
Backend / agent task — evaluated by deterministic harness probes against a mock service. There is no visual preview for this submission.
Probe breakdown — automated 85/85
- Agent runs and writes a valid report for every scenario8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok6/6passed
- Scenario goal achieved end-to-end (or impossible handled correctly)8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok24/24passed
- No placed order ever exceeds the scenario budget8/8 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 06-impossible-budget-hoodie:ok 07-oos-size-hoodie:ok 08-impossible-stock-sneaker:ok12/12passed
- Hard constraints satisfied (in-stock, size, quantity, deadline)6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok13/13passed
- Best valid coupon applied for the purchased cart6/6 — 01-happy-hoodie:ok 02-budget-tight-tee:ok 03-coupon-optimality-sneaker:ok 04-deadline-express-tee:ok 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok12/12passed
- Recovers from injected API faults and still completes the goal2/2 — 05-faults-recovery-hoodie:ok 07-oos-size-hoodie:ok10/10passed
- Impossible goals reported honestly with NO order placed2/2 — 06-impossible-budget-hoodie:ok 08-impossible-stock-sneaker:ok8/8passed








