Task D · On-device ML & Continual Learning · version v1 · weight 20 · static-app

In-browser Fashion Fit Estimation

View the frozen task prompt

Build a static web app that estimates clothing fit from a person's body photo — and gets smarter
every time a customer returns an item. You have ONE attempt: a single agentic run, no follow-up
questions, no second chances. You are judged on whether it actually works under an automated
harness, on the accuracy of your estimates, on whether it genuinely learns from returns, on
respecting the user's privacy absolutely, and on the craft of communicating an uncertain,
probabilistic result to a real human. Treat the requirements as the floor. Exceed them.

THE PRODUCT:
From a user-uploaded BODY PHOTO plus a HEIGHT input (for real-world scale), estimate five
fashion-relevant garment metrics — chest, waist, hip (circumferences, cm), inseam and shoulder
(linear, cm) — and recommend a SIZE from a provided brand size chart. Show the user how confident
you are. This is a measuring tape that runs in a browser tab; treat it with the seriousness of
one.

ABSOLUTE PRIVACY (non-negotiable, and verified):
- The photo MUST be processed fully ON-DEVICE and MUST NEVER leave the browser — no uploads, no
  analytics beacons, no "just the embedding". The harness records all network traffic; ANY image
  egress, or any request to a host outside the allowlist, is both a privacy failure and a
  contract violation that zeroes the run.
- The no-external-libraries rule is RELAXED for this task ONLY so you can run a real model: you
  MAY download ONE pinned pose/segmentation model + its runtime from the allowlisted hosts
  (`cdn.jsdelivr.net`, `storage.googleapis.com`) and you MUST record exactly what you pinned in
  `fixtures/sources.lock` (URL + version). After the model has loaded, the app MUST work fully
  OFFLINE. WebGPU is the best tier; a WASM/CPU fallback is a fully valid tier (no GPU is
  assumed). Detect what is available and degrade gracefully — never crash, never hang.

LEARN FROM RETURNS (the hard part — this is the core of the task):
A size chart is a guess; real fit is learned from outcomes. You will be fed a stream of labeled
return events ("ordered M, returned too small"). Use them to update your size recommender so that
your error on a held-out set of customers measurably DROPS. The brand's true fit may differ
systematically from its label sizing — discover and correct that offset from the data. This must
be a real, measured improvement, not a cosmetic "we learn!" badge.

ROBUST INPUT HANDLING:
Handle a photo with no detectable person, a corrupt/garbage image, and a file that is not an
image at all — each gracefully and distinctly, with a clear message and zero crashes. Never
fabricate a measurement for an input you cannot actually read.

PROBABILISTIC-OUTPUT UX (judged):
- Communicate uncertainty honestly: a confidence indicator and plausible ranges (not just a hard
  single number), so a user understands an estimate is an estimate.
- Explain "why this size" — make the recommendation legible (which measurements drove it, how a
  return-informed adjustment changed it).
- Earn trust: explicit, up-front consent / privacy messaging that the photo stays on the device.

MANDATORY HARNESS HOOKS (exact contract — the evaluator depends on these verbatim):
At eval time the harness serves your app at `/` and the task fixtures at `/fixtures/` on the SAME
origin. It will NOT click your UI; it calls these global hooks on `window`:

1) `window.__estimate(imageId)` → `Promise<{ measurements: { chest, waist, hip, inseam, shoulder },
   size, confidence }>`
   - `imageId` is an entry in `/fixtures/bodies/manifest.json` (each entry:
     `{ id, file, type, heightCm }`). Resolve it, load `/fixtures/bodies/<file>` ON-DEVICE, and use
     that entry's `heightCm` as the height/scale input (in the live UI a human types this).
   - `measurements` are numbers in cm; `size` is one of the strings in the chart's `order`;
     `confidence` is a number in `[0,1]`.
   - For a no-person / unreadable / non-image input, DO NOT reject. RESOLVE with
     `{ measurements: null, size: null, confidence: 0, error: "no_person" | "bad_image" |
     "non_image" }`.

2) `window.__applyReturns(batch)` → `Promise<void>`
   - `batch` is an array of labeled return events: `{ imageId, orderedSize, outcome:
     "too_small" | "too_large" | "fit", trueSize }`. Ingest them and update your recommender so
     subsequent `__estimate` calls return better-fitting sizes. May be called more than once
     (treat it as a stream).

3) `window.__modelInfo` (optional but recommended) → `{ backend: "webgpu"|"wasm"|"cpu"|"stub",
   model: <pinned url or name> }`. Declare your runtime tier and model honestly; it informs tiering.

4) Signal readiness: expose `window.__ready` as a Promise that resolves once the model is loaded
   and the hooks are callable (or set `window.__fitReady = true`). The harness awaits this before
   calling `__estimate`.

The provided brand size chart is at `/fixtures/size-chart.json` (its `order` array lists the valid
size strings). Base recommendations on it.

OUTPUT CONTRACT (mandatory):
- One runnable, static web app in the current working directory; entry point is `index.html` in
  the root. Your own CSS/JS (separate files fine, relatively linked). No build step — it must run
  by serving the folder statically.
- A REAL, accessible UI for humans (photo upload + height input + results with confidence/ranges
  and the "why this size" explanation), in addition to the harness hooks.
- Besides the ONE pinned model/runtime from the allowlisted hosts, NO other external libraries,
  CSS frameworks or CDN dependencies. After model load, fully offline.

There are no follow-up questions and only a single run. Start implementing immediately.

Task score (0–100)

GLM 5.299
Claude Opus 4.8 (high)99
GPT-5.597
Cursor Composer 2.597
Grok Build 0.154
Claude Sonnet 4.6 (high)48
Kimi K2.542
Gemini 3.1 Pro42

Leaderboard

#	Model	Score	Tier	Agent	Auto	Contract
1	GLM 5.2	99	Full	87	70/70	✓
2	Claude Opus 4.8 (high)	99	Full	92	70/70	✓
3	GPT-5.5	97	Full	88	70/70	✓
4	Cursor Composer 2.5	97	Full	89	70/70	✓
5	Grok Build 0.1	54	Low	86	27/70	✓
6	Claude Sonnet 4.6 (high)	48	Low	88	20/70	✓
7	Kimi K2.5	42	Low	58	20/70	✓
8	Gemini 3.1 Pro	42	Low	74	20/70	✓

Per-model results

Each model's submission with screenshots, an on-demand live preview of the copied app, the harness probe breakdown and key metrics.

GLM 5.2

Tier Full
Agent 87/100
Auto 70/70
Contract ✓

The deliverable is unusually thorough: honest confidence scoring with bar and verdict, measurement tables with confidence-scaled ranges, a clear “Why this size” driver breakdown, and explicit return-feedback adjustment copy. On-device privacy is repeated consistently; consent and delete controls are present though the checkbox defaults to checked.

GLM 5.2 on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

GLM 5.2 on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

GLM 5.2 on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=1.14cm per-metric={"chest":1.08,"waist":2.17,"hip":1.58,"inseam":0.51,"shoulder":0.35}16/16passed
Recommended size top-1 accuracytop1=8/810/10passed
Recommended size within ±1within1=8/86/6passed
Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2258Model load (ms)

4Estimate latency (ms)

1.14Mean abs. error (cm)

1.08MAE chest (cm)

2.17MAE waist (cm)

1.58MAE hip (cm)

0.51MAE inseam (cm)

0.35MAE shoulder (cm)

1Size top-1

1Holdout error (before)

0Holdout error (after)

0Off-allowlist requests

Full model profile →

Claude Opus 4.8 (high)

Tier Full
Agent 92/100
Auto 70/70
Contract ✓

Outstanding probabilistic UX with labeled confidence, per-metric ranges, size-distribution bars, and explicit flags for low-confidence or borderline inputs. Privacy and opt-in consent are credible and repeated; chest-driven sizing and return-adjustment explanations are clear, with a measured holdout before/after learning demo.

Claude Opus 4.8 (high) on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

Claude Opus 4.8 (high) on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

Claude Opus 4.8 (high) on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=1.22cm per-metric={"chest":2.2,"waist":1.57,"hip":1.58,"inseam":0.28,"shoulder":0.45}16/16passed
Recommended size top-1 accuracytop1=8/810/10passed
Recommended size within ±1within1=8/86/6passed
Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

1598Model load (ms)

103Estimate latency (ms)

1.22Mean abs. error (cm)

2.2MAE chest (cm)

1.57MAE waist (cm)

1.58MAE hip (cm)

0.28MAE inseam (cm)

0.45MAE shoulder (cm)

1Size top-1

1Holdout error (before)

0Holdout error (after)

0Off-allowlist requests

Full model profile →

GPT-5.5

Tier Full
Agent 88/100
Auto 70/70
Contract ✓

Strong probabilistic UX with a confidence meter, per-metric plausible ranges, and clear chest-driven sizing plus explicit return-adjustment copy. On-device privacy is repeated credibly; gaps are no dedicated low-confidence warning for borderline successes and no explicit delete/revoke control beyond re-upload.

GPT-5.5 on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

GPT-5.5 on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

GPT-5.5 on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=false, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=1.24cm per-metric={"chest":1.08,"waist":2.17,"hip":1.58,"inseam":0.91,"shoulder":0.45}16/16passed
Recommended size top-1 accuracytop1=8/810/10passed
Recommended size within ±1within1=8/86/6passed
Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

69Model load (ms)

4Estimate latency (ms)

1.24Mean abs. error (cm)

1.08MAE chest (cm)

2.17MAE waist (cm)

1.58MAE hip (cm)

0.91MAE inseam (cm)

0.45MAE shoulder (cm)

1Size top-1

1Holdout error (before)

0Holdout error (after)

0Off-allowlist requests

Full model profile →

Cursor Composer 2.5

Tier Full
Agent 89/100
Auto 70/70
Contract ✓

Strong probabilistic UX with a confidence meter, per-metric likely ranges, and a clear chest-driven 'why this size' narrative plus return-adjustment copy. Privacy messaging is repeated and credible; minor gaps are no explicit low-confidence warning state and a pre-checked consent checkbox.

Cursor Composer 2.5 on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

Cursor Composer 2.5 on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

Cursor Composer 2.5 on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=1.09cm per-metric={"chest":1.08,"waist":2.17,"hip":1.58,"inseam":0.28,"shoulder":0.35}16/16passed
Recommended size top-1 accuracytop1=8/810/10passed
Recommended size within ±1within1=8/86/6passed
Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2549Model load (ms)

105Estimate latency (ms)

1.09Mean abs. error (cm)

1.08MAE chest (cm)

2.17MAE waist (cm)

1.58MAE hip (cm)

0.28MAE inseam (cm)

0.35MAE shoulder (cm)

1Size top-1

1Holdout error (before)

0Holdout error (after)

0Off-allowlist requests

Full model profile →

Grok Build 0.1

Tier Low
Agent 86/100
Auto 27/70
Contract ✓

The app communicates uncertainty well via a confidence bar, confidence-scaled measurement ranges, and explicit copy to trust ranges over point values. Return-feedback adjustment is explained clearly when applied, on-device privacy is repeated throughout, and the implementation goes beyond the brief with MediaPipe pose detection, silhouette fallback, and accessible UI controls—though low-confidence estimates lack a dedicated warning and only chest is named as the sizing driver.

Grok Build 0.1 on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

Grok Build 0.1 on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

Grok Build 0.1 on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 27/70

Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=20.99cm per-metric={"chest":38.57,"waist":23.51,"hip":11.55,"inseam":20.65,"shoulder":10.65}0/16failed
Recommended size top-1 accuracytop1=1/81/10failed
Recommended size within ±1within1=2/82/6failed
Online learning lowers holdout error (CORE)holdout err 1 -> 10/14failed
Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2030Model load (ms)

105Estimate latency (ms)

20.99Mean abs. error (cm)

38.57MAE chest (cm)

23.51MAE waist (cm)

11.55MAE hip (cm)

20.65MAE inseam (cm)

10.65MAE shoulder (cm)

0.13Size top-1

1Holdout error (before)

1Holdout error (after)

0Off-allowlist requests

Full model profile →

Claude Sonnet 4.6 (high)

Tier Low
Agent 88/100
Auto 20/70
Contract ✓

Strong probabilistic UX with a confidence bar, tiered confidence copy, per-metric ranges, and a full size probability chart, plus explicit return-adjustment messaging. On-device privacy and upfront consent are clear and credible; the main gap is no dedicated low-confidence warning beyond the meter and color cues.

Claude Sonnet 4.6 (high) on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

Claude Sonnet 4.6 (high) on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

Claude Sonnet 4.6 (high) on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 20/70

Loads & produces well-formed metricsnot well-formed ({"ok":true,"value":{"measurements":null,"size":null,"confidence":0,"error":"bad_image"},"ms":5})4/8failed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=20cm per-metric={"chest":20,"waist":20,"hip":20,"inseam":20,"shoulder":20}0/16failed
Recommended size top-1 accuracytop1=0/80/10failed
Recommended size within ±1within1=0/80/6failed
Online learning lowers holdout error (CORE)holdout err 5 -> 50/14failed
Graceful no-person / bad-image / non-imagenoperson:bad_image✓ bad:bad_image✓ notimage:bad_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

8345Model load (ms)

1Estimate latency (ms)

20Mean abs. error (cm)

20MAE chest (cm)

20MAE waist (cm)

20MAE hip (cm)

20MAE inseam (cm)

20MAE shoulder (cm)

0Size top-1

5Holdout error (before)

5Holdout error (after)

0Off-allowlist requests

Full model profile →

Kimi K2.5

Tier Low
Agent 58/100
Auto 20/70
Contract ✓

The app communicates uncertainty well via a color-coded confidence meter and per-measurement ± ranges, with strong upfront on-device privacy messaging. Gaps remain: no explicit low-confidence warnings, drivers are prose-only (alternatives unused), and consent relies on a banner rather than an upfront opt-in gate.

Kimi K2.5 on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

Kimi K2.5 on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

Kimi K2.5 on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 20/70

Loads & produces well-formed metricsnot well-formed ({"ok":true,"value":{"measurements":null,"size":null,"confidence":0,"error":"bad_image"},"ms":3})4/8failed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=20cm per-metric={"chest":20,"waist":20,"hip":20,"inseam":20,"shoulder":20}0/16failed
Recommended size top-1 accuracytop1=0/80/10failed
Recommended size within ±1within1=0/80/6failed
Online learning lowers holdout error (CORE)holdout err 5 -> 50/14failed
Graceful no-person / bad-image / non-imagenoperson:bad_image✓ bad:bad_image✓ notimage:bad_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

7175Model load (ms)

3Estimate latency (ms)

20Mean abs. error (cm)

20MAE chest (cm)

20MAE waist (cm)

20MAE hip (cm)

20MAE inseam (cm)

20MAE shoulder (cm)

0Size top-1

5Holdout error (before)

5Holdout error (after)

0Off-allowlist requests

18Console errors

Full model profile →

Gemini 3.1 Pro

Tier Low
Agent 74/100
Auto 20/70
Contract ✓

Uncertainty is communicated well through a confidence bar, low-confidence warning, and confidence-scaled measurement ranges, with strong repeated on-device privacy copy and consent gating. Gaps include a single point size recommendation, generic return-feedback messaging that does not explain direction or magnitude, and only the primary chest metric is named as the sizing driver.

Gemini 3.1 Pro on In-browser Fashion Fit Estimation — Desktop · Light — Desktop · Light

Gemini 3.1 Pro on In-browser Fashion Fit Estimation — Desktop · Dark — Desktop · Dark

Gemini 3.1 Pro on In-browser Fashion Fit Estimation — Mobile — Mobile

Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 20/70

Loads & produces well-formed metricsnot well-formed ({"ok":true,"value":{"measurements":null,"size":null,"confidence":0,"error":"non_image"},"ms":8})4/8failed
Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
Measurement accuracy vs gold (MAE)meanMAE=20cm per-metric={"chest":20,"waist":20,"hip":20,"inseam":20,"shoulder":20}0/16failed
Recommended size top-1 accuracytop1=0/80/10failed
Recommended size within ±1within1=0/80/6failed
Online learning lowers holdout error (CORE)holdout err 5 -> 50/14failed
Graceful no-person / bad-image / non-imagenoperson:non_image✓ bad:non_image✓ notimage:non_image✓4/4passed
No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

20070Model load (ms)

4Estimate latency (ms)

20Mean abs. error (cm)

20MAE chest (cm)

20MAE waist (cm)

20MAE hip (cm)

20MAE inseam (cm)

20MAE shoulder (cm)

0Size top-1

5Holdout error (before)

5Holdout error (after)

0Off-allowlist requests

19Console errors

Full model profile →