Methodology · Spec v1

How the benchmark works

agenticcommerce.tech measures whether an AI coding model can turn a single specification into a product a customer would actually buy. The same prompt, one run, no follow-ups — then objective and AI-assisted scoring. Everything below is reproducible.

One fixed prompt

Every model receives the exact same frozen prompt (see below) — one shot, a fixed time budget, no follow-up questions and no iterations. The prompt is versioned; changing a word requires a new spec version.

Agentic run

The prompt is executed agentically via the Cursor SDK in a clean, empty workspace. The model writes real files on its own. We record telemetry: time to first action, number of tool calls, total run time and status.

Automated evaluation

The app is served headless and captured (desktop, mobile, dark). We run Lighthouse, axe-core, DOM feature probes (incl. a real WebGL-context check), console/network checks, structured-data (JSON-LD) detection, a mobile-overflow check and behavior verification that actually clicks add-to-cart and the dark-mode toggle.

Vision judge + scoring

A vision LLM-as-judge scores the subjective rubric from the screenshots and source code. Automated metrics and judge scores are merged into the final scores and the public leaderboard.

The fixed prompt

This is the literal text sent to every model (spec v1). It is deliberately hard: it requires a raw-WebGL 3D product configurator (no libraries), real cart & pricing logic, an agentic shopping assistant, search, structured data and full accessibility.

Build the product detail page of a premium online shop for the year 2030. It must look
high-end, offer state-of-the-art UX, and already be optimized for Agentic Commerce. This is
not just a UI exercise — it must demonstrate real eCommerce understanding and the ability to
solve non-trivial engineering problems in a single shot.

Pick a fictional premium product and a small catalog (one hero product + several related
products with real data: prices, ratings, stock, attributes).

REQUIRED — INTERACTIVE 3D PRODUCT CONFIGURATOR (WEBGL):
- Render the hero product in an interactive 3D configurator using the raw WebGL (or WebGL2)
  API. Write your own shaders. NO 3D/graphics libraries (no three.js, babylon, regl, etc.).
- The viewer must be animated (e.g. continuous subtle motion, 360° rotate, or a shader effect)
  AND interactive (drag to rotate / orbit, and move or change the light).
- It must be a real configurator: the user can configure the product across at least three
  option groups (e.g. color/material, a component or part, and a finish or add-on such as an
  engraving). Each choice updates the WebGL scene/material in real time.
- The configuration is the source of truth: it drives the displayed price (per-option price
  deltas), availability, the gallery, and what gets added to the cart (the chosen
  configuration must be reflected in the cart line item).
- Respect prefers-reduced-motion (provide a calm static fallback) and keep the canvas
  accessible (labelled, not a keyboard trap, with a non-WebGL fallback image and equivalent
  non-3D controls for every configuration option).

REQUIRED — REAL COMMERCE LOGIC (not just static markup):
- A working cart with add/remove, quantity changes, line items, live subtotal, tax, and a
  shipping rule with a free-shipping threshold; show it in a mini-cart/drawer.
- Variant-driven state: choosing color/size updates price, availability, gallery and the 3D
  material; handle out-of-stock and disabled combinations correctly.
- Dynamic pricing & promotions: discount with strikethrough, a "frequently bought together"
  bundle with a combined/discounted price, and at least one quantity break.
- A size / fit recommendation that computes a suggestion from user input.
- Instant client-side search or filtering over the catalog.
- Locale-aware price and number formatting (Intl APIs).

REQUIRED — AGENTIC COMMERCE:
- A built-in AI shopping assistant that actually performs actions in the UI (e.g. add to cart,
  recommend and apply a size, build a bundle, compare products) and shows its steps — not just
  a scripted chat bubble.

REQUIRED — QUALITY BAR:
- Mobile first, fluid micro-animations, Dark Mode, full WCAG AA accessibility.
- SEO done properly: semantic HTML and JSON-LD structured data (Product, Offer,
  AggregateRating, BreadcrumbList).
- Performance-oriented despite WebGL (target Lighthouse > 95): no layout shift, lazy work,
  no jank.

OUTPUT CONTRACT (mandatory):
- Deliver a single runnable, static web app in the current working directory.
- Entry point is an `index.html` file in the root. Use your own CSS and JS (separate files are
  fine, relatively linked).
- NO external libraries, UI/CSS frameworks, or CDN dependencies of any kind (no Tailwind,
  Bootstrap, Shadcn, jQuery, three.js, font CDNs, etc.). Everything must run offline.
- No build step: the app must work by serving the folder with a static web server.
- You may add sensible features, microcopy and details on your own initiative.

There are no follow-up questions and only a single run. Start implementing immediately.

Scoring

Five levels × 20 = 100 base points. Objective criteria are measured with tooling (Lighthouse, axe-core, DOM probes, console checks); subjective criteria are scored by a vision LLM judge against a fixed rubric.

Functional Completeness

20 pts

auto + judge

A 22-point feature checklist (3D WebGL configurator, gallery, zoom, variants, sizes, live stock, dynamic price, discount, buy box, sticky, reviews, cross-sell, search, mobile nav, wishlist, cart, AI assistant, dark mode, animations, accessibility), detected by DOM probes and confirmed by the judge. Behavior-weighted: cart and dark mode must actually work to earn full credit.

Visual Design

20 pts

judge

Typography, color, layout, premium feeling and consistency — does it look like Apple/Nike/Porsche or like 1998?

UX

20 pts

judge + auto

Above the fold, ease of purchase, information architecture, true mobile-first UX, and the quality of interactions (hover, animation, feedback, loading, transitions).

Engineering

20 pts

auto + judge

Component structure, maintainability and readability (judge), plus Performance (Lighthouse), Accessibility (axe-core + Lighthouse) and error-freeness (console/network) measured automatically.

AI Quality

20 pts

judge

Initiative, creativity, internal consistency, attention to detail and future-proofing — the things classic benchmarks never measure.

Bonus +20

Wow factor (“holy shit”) and originality — not Tailwind, not a stock template, genuinely new.

Penalties −

Console errors, broken links/assets, external libraries (contract violation), layout shift, mobile overflow, missing structured data and copy-paste templates.

Agent Score 100

A separate track for autonomous behaviour: did the model add sensible features, improve a11y/SEO/performance/microcopy and optimize the purchase flow on its own?

Ops metrics

Run time, tool-call count, time to first action/render and runtime errors — captured during the agentic run and evaluation.

Fairness & reproducibility

One fixed prompt, one run, a fixed wall-clock budget, and a clean empty workspace per model.
No external design libraries or CDNs are allowed in the generated apps — external requests are detected and penalized.
Pinned model identifiers; reasoning models run at effort high where available.
The same single judge model is used for every candidate to keep scoring consistent.
The benchmark spec is versioned (v1); changing the prompt requires a new version.

Models tested

Claude Opus 4.8 (high)GPT-5.5 Grok Build 0.1 Cursor Composer 2.5 GLM 5.2 Kimi K2.5 Gemini 3.1 Pro