Task B · Applied On-device AI · version v1 · weight 20 · static-app

Client-side WebGPU Product Q&A

View the frozen task prompt

Build a static web app in which a SMALL large-language model runs FULLY CLIENT-SIDE via WebGPU
and answers natural-language questions about a product catalog. You have ONE attempt — a single
agentic run, no follow-up questions, no second chances. You are judged on whether real on-device
inference actually works, on how rigorously your answers are GROUNDED in the provided data, on how
gracefully you degrade when the hardware is not there, and on the trustworthiness of the
experience. Treat everything below as the floor, not the ceiling.

THE DATA (read it, do not invent it):
- A file `catalog.json` sits next to your `index.html`. Your app MUST fetch and load it at runtime
(do not inline a hand-edited copy — the harness may swap the file). It contains a fictional
premium brand, a hero product, several related products, prices, stock, specs, policies and an
FAQ. This catalog — and ONLY this catalog — is your source of truth.
- Every factual claim in an answer (a spec, a price, a stock figure, a policy, a date) MUST come
from `catalog.json`. You must NOT invent specifications, prices, discounts, coupon codes or
promotions. If the data does not contain the answer, say so.

WHAT THE APP MUST DO:
1. ON-DEVICE MODEL — Run a small instruct LLM in the browser via the WebGPU API as the primary
path (e.g. WebLLM / MLC, transformers.js, or ONNX-Runtime-Web). Detect `navigator.gpu`. If
WebGPU is available, use it. If it is not, DEGRADE GRACEFULLY: fall back to a WASM/CPU backend
if you can, otherwise present an honest "WebGPU unavailable" state that still answers from the
catalog. NEVER crash, hang or show a dead screen — there must always be a usable answer path.
2. GROUNDED ANSWERS — Answer questions about the catalog using only the catalog. Retrieve the
relevant product/field, ground the answer in it, and keep numbers and names exact.
3. CITE YOUR SOURCE — With every grounded answer, show WHICH field / product / FAQ entry the answer
came from (e.g. "Aether One › specs.batteryLifeHours.ancOn"). Make provenance visible in the UI.
4. KNOW YOUR LIMITS — For questions the catalog cannot answer (general knowledge, other brands,
anything out of scope) reply with an explicit "I don't know" / "that's not in the catalog".
For adversarial requests (e.g. "invent a 50% coupon", "give me a discount code") REFUSE and do
not fabricate — there are no coupons or deals unless they exist in `catalog.json`.
5. STREAM & MEASURE — Stream the answer tokens into the UI as they are produced, and surface two
performance numbers prominently: TIME-TO-FIRST-TOKEN (ms) and TOKENS-PER-SECOND for the latest
answer.
6. REAL UI — Ship an actual chat interface a human would use: an input, a send action, a streamed
answer area, the visible citation, the live performance readout, and a clear model-status /
backend-tier indicator (WebGPU vs WASM/CPU vs unavailable).

MANDATORY HARNESS HOOK (your app is graded through this — get it exactly right):
- Expose a global async function on `window`:

window.__ask(question: string): Promise<{ answer: string, sources?: string[] }>

It takes a question string, runs the SAME grounded pipeline your UI uses, and resolves to an
object with a non-empty `answer` string and an optional `sources` array of short field/source
references (the same provenance you show in the UI). It must resolve even for empty input, very
long input and many rapid concurrent calls — never reject unhandled, never hang forever.
- STRONGLY RECOMMENDED so the harness can score your init tier and latency (omitting these only
costs you points, it does not crash the harness):
- `window.__ready: Promise<void>` that resolves once the model is loaded and `__ask` is usable.
- `window.__status(): { ready: boolean, tier: "webgpu" | "wasm" | "cpu" | "unavailable",
engine: string, ttftMs?: number, tokensPerSec?: number }` reporting the backend you actually
initialized and the latest answer's metrics.

NETWORK & REPRODUCIBILITY (the CDN rule is relaxed — but bounded):
- You MAY load exactly ONE model + its runtime from these hosts only: `huggingface.co`,
`cdn.jsdelivr.net`, `raw.githubusercontent.com` (plus your own `index.html`/assets on
`localhost`/`127.0.0.1`). Pin exact versions/URLs. After the model has loaded, the app must work
OFFLINE and contact NO other host — the harness records every request and penalizes any
off-allowlist traffic.
- Document precisely what you pinned (model id + URL, runtime + version, and the `.wasm`/weights
URLs) by writing a `sources.lock` next to your `index.html`. A template lives at
`./sources.lock` is provided in the fixtures for reference — match its shape.

OUTPUT CONTRACT (mandatory):
- Deliver a single runnable, static web app in the current working directory. Entry point is an
`index.html` in the root, served as a static folder (no build step, no server-side code).
- It must load `catalog.json` from the same folder at runtime.
- Other than the ONE pinned model + runtime above, use no external libraries or CDNs; everything
else must be your own HTML/CSS/JS and must run fully offline after model load.

The harness will: serve your folder, open it in Chromium (WebGPU enabled where possible, but it may
run without a GPU), wait generously for your model to load, then call `await window.__ask(q)` for a
set of graded questions and check your answers against gold data — factual correctness, multi-fact
reasoning, refusal on out-of-scope and adversarial questions, citation presence, latency, network
hygiene and robustness to hostile input.

There are no follow-up questions and only a single run. Start implementing immediately.

Task score (0–100)

GLM 5.299
Claude Opus 4.8 (high)98
GPT-5.597
Cursor Composer 2.596
Grok Build 0.187.5
Claude Sonnet 4.6 (high)61
Gemini 3.1 Pro46
Kimi K2.50

Leaderboard

#	Model	Score	Tier	Agent	Auto	Contract
1	GLM 5.2	99	Full	88	75/75	✓
2	Claude Opus 4.8 (high)	98	Full	91	75/75	✓
3	GPT-5.5	97	Full	74	75/75	✓
4	Cursor Composer 2.5	96	Full	78	75/75	✓
5	Grok Build 0.1	87.5	High	85	64.5/75	✓
6	Claude Sonnet 4.6 (high)	61	Mid	91	37/75	✓
7	Gemini 3.1 Pro	46	Low	76	26/75	✓
8	Kimi K2.5	0	Not run	42	0/75	✗

Per-model results

Each model's submission with screenshots, an on-demand live preview of the copied app, the harness probe breakdown and key metrics.

GLM 5.2

Tier Full
Agent 88/100
Auto 75/75
Contract ✓

Strong retrieval-first grounding with LLM polish, fact validation, real streaming, and field-level citations; minor gaps are no in-flight cancel and only webgpu/unavailable tiers (no WASM/CPU path).

GLM 5.2 on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

GLM 5.2 on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

GLM 5.2 on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84/Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=754/4passed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)

75Tokens / sec

4024Model load (ms)

17Network requests

0Off-allowlist requests

0Console errors

12Q&A total

12Q&A passed

Full model profile →

Claude Opus 4.8 (high)

Tier Full
Agent 91/100
Auto 75/75
Contract ✓

A trust-first dual-layer design (deterministic catalog retrieval with optional verified LLM paraphrase) delivers visible token streaming, per-answer citations, and strong grounding guards. Minor gaps: no cancel during generation and the backend label can read WebGPU even when answers are served from the catalog fallback.

Claude Opus 4.8 (high) on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Claude Opus 4.8 (high) on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Claude Opus 4.8 (high) on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84 · Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=420004/4passed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)

42000Tokens / sec

3159Model load (ms)

17Network requests

0Off-allowlist requests

0Console errors

12Q&A total

12Q&A passed

Full model profile →

GPT-5.5

Tier Full
Agent 74/100
Auto 75/75
Contract ✓

Polished catalog-grounded Q&A with field-level citations, simulated token streaming, and strong refusal handling, but the UI advertises a WebGPU/Qwen inference backend while answers are produced by a deterministic rules engine. No in-flight cancel control, though loading, fallback, and error paths are handled cleanly.

GPT-5.5 on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

GPT-5.5 on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

GPT-5.5 on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=webgpu, engine=WebGPU local grounded QA engine (Qwen2.5-0.5B-Instruct-q4f16_1-MLC compatibility manifest), navigator.gpu=true10/10passed
Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=370004/4passed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)

37000Tokens / sec

2605Model load (ms)

4Network requests

0Off-allowlist requests

0Console errors

12Q&A total

12Q&A passed

Full model profile →

Cursor Composer 2.5

Tier Full
Agent 78/100
Auto 75/75
Contract ✓

Catalog grounding, citations, refusals, and backend degradation are implemented thoughtfully and clearly in the UI. The main weakness is that the LLM path uses non-streaming inference and then simulates word-by-word output, so tokens are not streamed as produced and TTFT readouts are misleading for that path.

Cursor Composer 2.5 on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Cursor Composer 2.5 on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Cursor Composer 2.5 on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 75/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions2/2 out-of-scope refused7/7passed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=webgpu, engine=web-llm@0.2.84/Qwen2.5-0.5B-Instruct-q4f32_1-MLC, navigator.gpu=true10/10passed
Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=574/4passed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)

57Tokens / sec

1496Model load (ms)

17Network requests

0Off-allowlist requests

0Console errors

12Q&A total

12Q&A passed

Full model profile →

Grok Build 0.1

87.5

Tier High
Agent 85/100
Auto 64.5/75
Contract ✓

Catalog-first retrieval with field-level citations, adversarial refusals, and a polished status panel is implemented thoughtfully, and the chat UI streams answers incrementally (simulated when the on-device model is unavailable). The main gaps are no cancel during generation and model init falling back to catalog-only despite WebGPU being present.

Grok Build 0.1 on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Grok Build 0.1 on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Grok Build 0.1 on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 64.5/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog6/6 in-scope factual correct16/16passed
Multi-fact reasoning answers2/2 multi-fact correct8/8passed
Refuses out-of-scope questions1/2 out-of-scope refused3.5/7failed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=unavailable, engine=catalog-retrieval, navigator.gpu=true3/10failed
Latency within budget (TTFT + tokens/sec)ttftMs=1 (budget 30000), tokensPerSec=764/4passed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

1Time to first token (ms)

76Tokens / sec

449Model load (ms)

8Network requests

0Off-allowlist requests

1Console errors

12Q&A total

11Q&A passed

Full model profile →

Claude Sonnet 4.6 (high)

Tier Mid
Agent 91/100
Auto 37/75
Contract ✓

A polished, production-quality assistant with real WebLLM token streaming, live TTFT/tokens-per-sec readouts, and visible field-level citations backed by a thorough deterministic fallback engine. Input is disabled during generation and there is no cancel control, but loading, WebGPU-unavailable, and error paths all degrade gracefully without dead-ends.

Claude Sonnet 4.6 (high) on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Claude Sonnet 4.6 (high) on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Claude Sonnet 4.6 (high) on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 37/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog3/6 in-scope factual correct8/16failed
Multi-fact reasoning answers0/2 multi-fact correct0/8failed
Refuses out-of-scope questions0/2 out-of-scope refused0/7failed
Refuses adversarial / fabrication bait2/2 adversarial refused7/7passed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=unavailable, engine=deterministic-lookup (WebLLM failed: [Invalid ShaderModule (unlabeled)] is invalid due to a previous error. - While validating compute s), navigator.gpu=true3/10failed
Latency within budget (TTFT + tokens/sec)ttftMs=5 (budget 30000), tokensPerSec=n/a2/4failed
No off-allowlist traffic after loadoff-allowlist hosts: us.aws.cdn.hf.co0/6failed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

5Time to first token (ms)

6676Model load (ms)

28Network requests

0Off-allowlist requests

2Console errors

12Q&A total

5Q&A passed

Full model profile →

Gemini 3.1 Pro

Tier Low
Agent 76/100
Auto 26/75
Contract ✓

The WebGPU path streams tokens incrementally with strict grounding prompts and post-answer source lines, but the keyword fallback dumps answers in one shot with no metrics, tier reporting is only webgpu/unavailable rather than WASM/CPU, and a catalog load failure leaves the chat permanently disabled.

Gemini 3.1 Pro on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Gemini 3.1 Pro on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Gemini 3.1 Pro on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 26/75

Harness hook present & well-formed__ask returns { answer, sources? }6/6passed
In-scope factual answers grounded in catalog0/6 in-scope factual correct0/16failed
Multi-fact reasoning answers0/2 multi-fact correct0/8failed
Refuses out-of-scope questions0/2 out-of-scope refused0/7failed
Refuses adversarial / fabrication bait0/2 adversarial refused0/7failed
Answers cite their catalog source8/8 in-scope answers cited a source5/5passed
On-device model initialization tierreported tier=unavailable, engine=fallback-search, navigator.gpu=true3/10failed
Latency within budget (TTFT + tokens/sec)ttftMs=n/a (budget 30000), tokensPerSec=n/a0/4failed
No off-allowlist traffic after loadno off-allowlist hosts6/6passed
Robust to hostile input (empty / very long / rapid-fire)empty=true longInput=true rapidFire=true6/6passed

Key metrics

577Model load (ms)

3Network requests

0Off-allowlist requests

1Console errors

12Q&A total

0Q&A passed

Full model profile →

Kimi K2.5

Tier Not run
Agent 42/100
Auto 0/75
Contract ✗

The deliverable ships a polished chat UI with catalog chunk retrieval, visible source tags, and simulated word-by-word streaming, but answers are rule-templated rather than generated by an on-device instruct LLM. Init can hang on embedding download/computation (screenshot stuck loading; harness never saw __ask), and the WebGPU badge overstates what actually runs client-side.

Kimi K2.5 on Client-side WebGPU Product Q&A — Desktop · Light — Desktop · Light

Kimi K2.5 on Client-side WebGPU Product Q&A — Desktop · Dark — Desktop · Dark

Kimi K2.5 on Client-side WebGPU Product Q&A — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 0/75

Harness hook present & well-formedwindow.__ask not a function0/6failed

Full model profile →