Task B · Applied On-device AI · version v1 · weight 20 · static-app

Client-side WebGPU Product Q&A

View the frozen task prompt
Build a static web app in which a SMALL large-language model runs FULLY CLIENT-SIDE via WebGPU
and answers natural-language questions about a product catalog. You have ONE attempt — a single
agentic run, no follow-up questions, no second chances. You are judged on whether real on-device
inference actually works, on how rigorously your answers are GROUNDED in the provided data, on how
gracefully you degrade when the hardware is not there, and on the trustworthiness of the
experience. Treat everything below as the floor, not the ceiling.

THE DATA (read it, do not invent it):
- A file `catalog.json` sits next to your `index.html`. Your app MUST fetch and load it at runtime
  (do not inline a hand-edited copy — the harness may swap the file). It contains a fictional
  premium brand, a hero product, several related products, prices, stock, specs, policies and an
  FAQ. This catalog — and ONLY this catalog — is your source of truth.
- Every factual claim in an answer (a spec, a price, a stock figure, a policy, a date) MUST come
  from `catalog.json`. You must NOT invent specifications, prices, discounts, coupon codes or
  promotions. If the data does not contain the answer, say so.

WHAT THE APP MUST DO:
1. ON-DEVICE MODEL — Run a small instruct LLM in the browser via the WebGPU API as the primary
   path (e.g. WebLLM / MLC, transformers.js, or ONNX-Runtime-Web). Detect `navigator.gpu`. If
   WebGPU is available, use it. If it is not, DEGRADE GRACEFULLY: fall back to a WASM/CPU backend
   if you can, otherwise present an honest "WebGPU unavailable" state that still answers from the
   catalog. NEVER crash, hang or show a dead screen — there must always be a usable answer path.
2. GROUNDED ANSWERS — Answer questions about the catalog using only the catalog. Retrieve the
   relevant product/field, ground the answer in it, and keep numbers and names exact.
3. CITE YOUR SOURCE — With every grounded answer, show WHICH field / product / FAQ entry the answer
   came from (e.g. "Aether One › specs.batteryLifeHours.ancOn"). Make provenance visible in the UI.
4. KNOW YOUR LIMITS — For questions the catalog cannot answer (general knowledge, other brands,
   anything out of scope) reply with an explicit "I don't know" / "that's not in the catalog".
   For adversarial requests (e.g. "invent a 50% coupon", "give me a discount code") REFUSE and do
   not fabricate — there are no coupons or deals unless they exist in `catalog.json`.
5. STREAM & MEASURE — Stream the answer tokens into the UI as they are produced, and surface two
   performance numbers prominently: TIME-TO-FIRST-TOKEN (ms) and TOKENS-PER-SECOND for the latest
   answer.
6. REAL UI — Ship an actual chat interface a human would use: an input, a send action, a streamed
   answer area, the visible citation, the live performance readout, and a clear model-status /
   backend-tier indicator (WebGPU vs WASM/CPU vs unavailable).

MANDATORY HARNESS HOOK (your app is graded through this — get it exactly right):
- Expose a global async function on `window`:

      window.__ask(question: string): Promise<{ answer: string, sources?: string[] }>

  It takes a question string, runs the SAME grounded pipeline your UI uses, and resolves to an
  object with a non-empty `answer` string and an optional `sources` array of short field/source
  references (the same provenance you show in the UI). It must resolve even for empty input, very
  long input and many rapid concurrent calls — never reject unhandled, never hang forever.
- STRONGLY RECOMMENDED so the harness can score your init tier and latency (omitting these only
  costs you points, it does not crash the harness):
  - `window.__ready: Promise<void>` that resolves once the model is loaded and `__ask` is usable.
  - `window.__status(): { ready: boolean, tier: "webgpu" | "wasm" | "cpu" | "unavailable",
    engine: string, ttftMs?: number, tokensPerSec?: number }` reporting the backend you actually
    initialized and the latest answer's metrics.

NETWORK & REPRODUCIBILITY (the CDN rule is relaxed — but bounded):
- You MAY load exactly ONE model + its runtime from these hosts only: `huggingface.co`,
  `cdn.jsdelivr.net`, `raw.githubusercontent.com` (plus your own `index.html`/assets on
  `localhost`/`127.0.0.1`). Pin exact versions/URLs. After the model has loaded, the app must work
  OFFLINE and contact NO other host — the harness records every request and penalizes any
  off-allowlist traffic.
- Document precisely what you pinned (model id + URL, runtime + version, and the `.wasm`/weights
  URLs) by writing a `sources.lock` next to your `index.html`. A template lives at
  `./sources.lock` is provided in the fixtures for reference — match its shape.

OUTPUT CONTRACT (mandatory):
- Deliver a single runnable, static web app in the current working directory. Entry point is an
  `index.html` in the root, served as a static folder (no build step, no server-side code).
- It must load `catalog.json` from the same folder at runtime.
- Other than the ONE pinned model + runtime above, use no external libraries or CDNs; everything
  else must be your own HTML/CSS/JS and must run fully offline after model load.

The harness will: serve your folder, open it in Chromium (WebGPU enabled where possible, but it may
run without a GPU), wait generously for your model to load, then call `await window.__ask(q)` for a
set of graded questions and check your answers against gold data — factual correctness, multi-fact
reasoning, refusal on out-of-scope and adversarial questions, citation presence, latency, network
hygiene and robustness to hostile input.

There are no follow-up questions and only a single run. Start implementing immediately.

Task score (0–100)

  • GLM 5.299
  • Claude Opus 4.8 (high)98
  • GPT-5.597
  • Cursor Composer 2.596
  • Grok Build 0.187.5
  • Claude Sonnet 4.6 (high)61
  • Gemini 3.1 Pro46
  • Kimi K2.50

Leaderboard

#ModelScoreTierAgentAutoContract
1GLM 5.299Full8875/75
2Claude Opus 4.8 (high)98Full9175/75
3GPT-5.597Full7475/75
4Cursor Composer 2.596Full7875/75
5Grok Build 0.187.5High8564.5/75
6Claude Sonnet 4.6 (high)61Mid9137/75
7Gemini 3.1 Pro46Low7626/75
8Kimi K2.50Not run420/75

Per-model results

Each model's submission with screenshots, an on-demand live preview of the copied app, the harness probe breakdown and key metrics.

#1

GLM 5.2

99
  • Tier Full
  • Agent 88/100
  • Auto 75/75
  • Contract ✓

Strong retrieval-first grounding with LLM polish, fact validation, real streaming, and field-level citations; minor gaps are no in-flight cancel and only webgpu/unavailable tiers (no WASM/CPU path).

GLM 5.2 on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
GLM 5.2 on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
GLM 5.2 on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84/Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
  • Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=754/4passed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)
75Tokens / sec
4024Model load (ms)
17Network requests
0Off-allowlist requests
0Console errors
12Q&A total
12Q&A passed
Full model profile →
#2

Claude Opus 4.8 (high)

98
  • Tier Full
  • Agent 91/100
  • Auto 75/75
  • Contract ✓

A trust-first dual-layer design (deterministic catalog retrieval with optional verified LLM paraphrase) delivers visible token streaming, per-answer citations, and strong grounding guards. Minor gaps: no cancel during generation and the backend label can read WebGPU even when answers are served from the catalog fallback.

Claude Opus 4.8 (high) on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Claude Opus 4.8 (high) on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Claude Opus 4.8 (high) on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84 · Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
  • Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=420004/4passed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)
42000Tokens / sec
3159Model load (ms)
17Network requests
0Off-allowlist requests
0Console errors
12Q&A total
12Q&A passed
Full model profile →
#3

GPT-5.5

97
  • Tier Full
  • Agent 74/100
  • Auto 75/75
  • Contract ✓

Polished catalog-grounded Q&A with field-level citations, simulated token streaming, and strong refusal handling, but the UI advertises a WebGPU/Qwen inference backend while answers are produced by a deterministic rules engine. No in-flight cancel control, though loading, fallback, and error paths are handled cleanly.

GPT-5.5 on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
GPT-5.5 on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
GPT-5.5 on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=webgpu, engine=WebGPU local grounded QA engine (Qwen2.5-0.5B-Instruct-q4f16_1-MLC compatibility manifest), navigator.gpu=true10/10passed
  • Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=370004/4passed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)
37000Tokens / sec
2605Model load (ms)
4Network requests
0Off-allowlist requests
0Console errors
12Q&A total
12Q&A passed
Full model profile →
#4

Cursor Composer 2.5

96
  • Tier Full
  • Agent 78/100
  • Auto 75/75
  • Contract ✓

Catalog grounding, citations, refusals, and backend degradation are implemented thoughtfully and clearly in the UI. The main weakness is that the LLM path uses non-streaming inference and then simulates word-by-word output, so tokens are not streamed as produced and TTFT readouts are misleading for that path.

Cursor Composer 2.5 on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Cursor Composer 2.5 on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Cursor Composer 2.5 on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84/Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
  • Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=574/4passed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)
57Tokens / sec
1496Model load (ms)
17Network requests
0Off-allowlist requests
0Console errors
12Q&A total
12Q&A passed
Full model profile →
#5

Grok Build 0.1

87.5
  • Tier High
  • Agent 85/100
  • Auto 64.5/75
  • Contract ✓

Catalog-first retrieval with field-level citations, adversarial refusals, and a polished status panel is implemented thoughtfully, and the chat UI streams answers incrementally (simulated when the on-device model is unavailable). The main gaps are no cancel during generation and model init falling back to catalog-only despite WebGPU being present.

Grok Build 0.1 on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Grok Build 0.1 on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Grok Build 0.1 on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 64.5/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
  • Multi-fact reasoning answers2/2 multi-fact correct8/8passed
  • Refuses out-of-scope questions1/2 out-of-scope refused3.5/7failed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=unavailable, engine=catalog-retrieval, navigator.gpu=true3/10failed
  • Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=764/4passed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)
76Tokens / sec
449Model load (ms)
8Network requests
0Off-allowlist requests
1Console errors
12Q&A total
11Q&A passed
Full model profile →
#6

Claude Sonnet 4.6 (high)

61
  • Tier Mid
  • Agent 91/100
  • Auto 37/75
  • Contract ✓

A polished, production-quality assistant with real WebLLM token streaming, live TTFT/tokens-per-sec readouts, and visible field-level citations backed by a thorough deterministic fallback engine. Input is disabled during generation and there is no cancel control, but loading, WebGPU-unavailable, and error paths all degrade gracefully without dead-ends.

Claude Sonnet 4.6 (high) on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Claude Sonnet 4.6 (high) on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Claude Sonnet 4.6 (high) on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 37/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog3/6 in-scope factual correct8/16failed
  • Multi-fact reasoning answers0/2 multi-fact correct0/8failed
  • Refuses out-of-scope questions0/2 out-of-scope refused0/7failed
  • Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=unavailable, engine=deterministic-lookup (WebLLM failed: [Invalid ShaderModule (unlabeled)] is invalid due to a previous error. - While validating compute s), navigator.gpu=true3/10failed
  • Latency within budget (TTFT + tokens/sec)ttftMs=5 (budget 30000), tokensPerSec=n/a2/4failed
  • No off-allowlist traffic after loadoff-allowlist hosts: us.aws.cdn.hf.co0/6failed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

5Time to first token (ms)
6676Model load (ms)
28Network requests
0Off-allowlist requests
2Console errors
12Q&A total
5Q&A passed
Full model profile →
#7

Gemini 3.1 Pro

46
  • Tier Low
  • Agent 76/100
  • Auto 26/75
  • Contract ✓

The WebGPU path streams tokens incrementally with strict grounding prompts and post-answer source lines, but the keyword fallback dumps answers in one shot with no metrics, tier reporting is only webgpu/unavailable rather than WASM/CPU, and a catalog load failure leaves the chat permanently disabled.

Gemini 3.1 Pro on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Gemini 3.1 Pro on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Gemini 3.1 Pro on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 26/75
  • Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
  • In-scope factual answers grounded in catalog0/6 in-scope factual correct0/16failed
  • Multi-fact reasoning answers0/2 multi-fact correct0/8failed
  • Refuses out-of-scope questions0/2 out-of-scope refused0/7failed
  • Refuses adversarial / fabrication bait0/2 adversarial refused0/7failed
  • Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
  • On-device model initialization tierreported tier=unavailable, engine=fallback-search, navigator.gpu=true3/10failed
  • Latency within budget (TTFT + tokens/sec)ttftMs=n/a (budget 30000), tokensPerSec=n/a0/4failed
  • No off-allowlist traffic after loadno off-allowlist hosts6/6passed
  • Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

577Model load (ms)
3Network requests
0Off-allowlist requests
1Console errors
12Q&A total
0Q&A passed
Full model profile →
#8

Kimi K2.5

0
  • Tier Not run
  • Agent 42/100
  • Auto 0/75
  • Contract ✗

No working deliverable — the harness could not run this submission, so the task score is 0. The breakdown below shows what failed.

The deliverable ships a polished chat UI with catalog chunk retrieval, visible source tags, and simulated word-by-word streaming, but answers are rule-templated rather than generated by an on-device instruct LLM. Init can hang on embedding download/computation (screenshot stuck loading; harness never saw __ask), and the WebGPU badge overstates what actually runs client-side.

Kimi K2.5 on Client-side WebGPU Product Q&A — Desktop · Light
Desktop · Light
Kimi K2.5 on Client-side WebGPU Product Q&A — Desktop · Dark
Desktop · Dark
Kimi K2.5 on Client-side WebGPU Product Q&A — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 0/75
  • Harness hook present & well-formedwindow.__ask not a function0/6failed
Full model profile →