Task D · On-device ML & Continual Learning · version v1 · weight 20 · static-app

In-browser Fashion Fit Estimation

View the frozen task prompt
Build a static web app that estimates clothing fit from a person's body photo — and gets smarter
every time a customer returns an item. You have ONE attempt: a single agentic run, no follow-up
questions, no second chances. You are judged on whether it actually works under an automated
harness, on the accuracy of your estimates, on whether it genuinely learns from returns, on
respecting the user's privacy absolutely, and on the craft of communicating an uncertain,
probabilistic result to a real human. Treat the requirements as the floor. Exceed them.

THE PRODUCT:
From a user-uploaded BODY PHOTO plus a HEIGHT input (for real-world scale), estimate five
fashion-relevant garment metrics — chest, waist, hip (circumferences, cm), inseam and shoulder
(linear, cm) — and recommend a SIZE from a provided brand size chart. Show the user how confident
you are. This is a measuring tape that runs in a browser tab; treat it with the seriousness of
one.

ABSOLUTE PRIVACY (non-negotiable, and verified):
- The photo MUST be processed fully ON-DEVICE and MUST NEVER leave the browser — no uploads, no
  analytics beacons, no "just the embedding". The harness records all network traffic; ANY image
  egress, or any request to a host outside the allowlist, is both a privacy failure and a
  contract violation that zeroes the run.
- The no-external-libraries rule is RELAXED for this task ONLY so you can run a real model: you
  MAY download ONE pinned pose/segmentation model + its runtime from the allowlisted hosts
  (`cdn.jsdelivr.net`, `storage.googleapis.com`) and you MUST record exactly what you pinned in
  `fixtures/sources.lock` (URL + version). After the model has loaded, the app MUST work fully
  OFFLINE. WebGPU is the best tier; a WASM/CPU fallback is a fully valid tier (no GPU is
  assumed). Detect what is available and degrade gracefully — never crash, never hang.

LEARN FROM RETURNS (the hard part — this is the core of the task):
A size chart is a guess; real fit is learned from outcomes. You will be fed a stream of labeled
return events ("ordered M, returned too small"). Use them to update your size recommender so that
your error on a held-out set of customers measurably DROPS. The brand's true fit may differ
systematically from its label sizing — discover and correct that offset from the data. This must
be a real, measured improvement, not a cosmetic "we learn!" badge.

ROBUST INPUT HANDLING:
Handle a photo with no detectable person, a corrupt/garbage image, and a file that is not an
image at all — each gracefully and distinctly, with a clear message and zero crashes. Never
fabricate a measurement for an input you cannot actually read.

PROBABILISTIC-OUTPUT UX (judged):
- Communicate uncertainty honestly: a confidence indicator and plausible ranges (not just a hard
  single number), so a user understands an estimate is an estimate.
- Explain "why this size" — make the recommendation legible (which measurements drove it, how a
  return-informed adjustment changed it).
- Earn trust: explicit, up-front consent / privacy messaging that the photo stays on the device.

MANDATORY HARNESS HOOKS (exact contract — the evaluator depends on these verbatim):
At eval time the harness serves your app at `/` and the task fixtures at `/fixtures/` on the SAME
origin. It will NOT click your UI; it calls these global hooks on `window`:

1) `window.__estimate(imageId)` → `Promise<{ measurements: { chest, waist, hip, inseam, shoulder },
   size, confidence }>`
   - `imageId` is an entry in `/fixtures/bodies/manifest.json` (each entry:
     `{ id, file, type, heightCm }`). Resolve it, load `/fixtures/bodies/<file>` ON-DEVICE, and use
     that entry's `heightCm` as the height/scale input (in the live UI a human types this).
   - `measurements` are numbers in cm; `size` is one of the strings in the chart's `order`;
     `confidence` is a number in `[0,1]`.
   - For a no-person / unreadable / non-image input, DO NOT reject. RESOLVE with
     `{ measurements: null, size: null, confidence: 0, error: "no_person" | "bad_image" |
     "non_image" }`.

2) `window.__applyReturns(batch)` → `Promise<void>`
   - `batch` is an array of labeled return events: `{ imageId, orderedSize, outcome:
     "too_small" | "too_large" | "fit", trueSize }`. Ingest them and update your recommender so
     subsequent `__estimate` calls return better-fitting sizes. May be called more than once
     (treat it as a stream).

3) `window.__modelInfo` (optional but recommended) → `{ backend: "webgpu"|"wasm"|"cpu"|"stub",
   model: <pinned url or name> }`. Declare your runtime tier and model honestly; it informs tiering.

4) Signal readiness: expose `window.__ready` as a Promise that resolves once the model is loaded
   and the hooks are callable (or set `window.__fitReady = true`). The harness awaits this before
   calling `__estimate`.

The provided brand size chart is at `/fixtures/size-chart.json` (its `order` array lists the valid
size strings). Base recommendations on it.

OUTPUT CONTRACT (mandatory):
- One runnable, static web app in the current working directory; entry point is `index.html` in
  the root. Your own CSS/JS (separate files fine, relatively linked). No build step — it must run
  by serving the folder statically.
- A REAL, accessible UI for humans (photo upload + height input + results with confidence/ranges
  and the "why this size" explanation), in addition to the harness hooks.
- Besides the ONE pinned model/runtime from the allowlisted hosts, NO other external libraries,
  CSS frameworks or CDN dependencies. After model load, fully offline.

There are no follow-up questions and only a single run. Start implementing immediately.

Task score (0–100)

  • GLM 5.299
  • Claude Opus 4.8 (high)99
  • GPT-5.597
  • Cursor Composer 2.597
  • Grok Build 0.154
  • Claude Sonnet 4.6 (high)48
  • Kimi K2.542
  • Gemini 3.1 Pro42

Leaderboard

#ModelScoreTierAgentAutoContract
1GLM 5.299Full8770/70
2Claude Opus 4.8 (high)99Full9270/70
3GPT-5.597Full8870/70
4Cursor Composer 2.597Full8970/70
5Grok Build 0.154Low8627/70
6Claude Sonnet 4.6 (high)48Low8820/70
7Kimi K2.542Low5820/70
8Gemini 3.1 Pro42Low7420/70

Per-model results

Each model's submission with screenshots, an on-demand live preview of the copied app, the harness probe breakdown and key metrics.

#1

GLM 5.2

99
  • Tier Full
  • Agent 87/100
  • Auto 70/70
  • Contract ✓

The deliverable is unusually thorough: honest confidence scoring with bar and verdict, measurement tables with confidence-scaled ranges, a clear “Why this size” driver breakdown, and explicit return-feedback adjustment copy. On-device privacy is repeated consistently; consent and delete controls are present though the checkbox defaults to checked.

GLM 5.2 on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
GLM 5.2 on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
GLM 5.2 on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=1.14cm per-metric={"chest":1.08,"waist":2.17,"hip":1.58,"inseam":0.51,"shoulder":0.35}16/16passed
  • Recommended size top-1 accuracytop1=8/810/10passed
  • Recommended size within ±1within1=8/86/6passed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2258Model load (ms)
4Estimate latency (ms)
1.14Mean abs. error (cm)
1.08MAE chest (cm)
2.17MAE waist (cm)
1.58MAE hip (cm)
0.51MAE inseam (cm)
0.35MAE shoulder (cm)
1Size top-1
1Holdout error (before)
0Holdout error (after)
0Off-allowlist requests
Full model profile →
#2

Claude Opus 4.8 (high)

99
  • Tier Full
  • Agent 92/100
  • Auto 70/70
  • Contract ✓

Outstanding probabilistic UX with labeled confidence, per-metric ranges, size-distribution bars, and explicit flags for low-confidence or borderline inputs. Privacy and opt-in consent are credible and repeated; chest-driven sizing and return-adjustment explanations are clear, with a measured holdout before/after learning demo.

Claude Opus 4.8 (high) on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
Claude Opus 4.8 (high) on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
Claude Opus 4.8 (high) on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=1.22cm per-metric={"chest":2.2,"waist":1.57,"hip":1.58,"inseam":0.28,"shoulder":0.45}16/16passed
  • Recommended size top-1 accuracytop1=8/810/10passed
  • Recommended size within ±1within1=8/86/6passed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

1598Model load (ms)
103Estimate latency (ms)
1.22Mean abs. error (cm)
2.2MAE chest (cm)
1.57MAE waist (cm)
1.58MAE hip (cm)
0.28MAE inseam (cm)
0.45MAE shoulder (cm)
1Size top-1
1Holdout error (before)
0Holdout error (after)
0Off-allowlist requests
Full model profile →
#3

GPT-5.5

97
  • Tier Full
  • Agent 88/100
  • Auto 70/70
  • Contract ✓

Strong probabilistic UX with a confidence meter, per-metric plausible ranges, and clear chest-driven sizing plus explicit return-adjustment copy. On-device privacy is repeated credibly; gaps are no dedicated low-confidence warning for borderline successes and no explicit delete/revoke control beyond re-upload.

GPT-5.5 on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
GPT-5.5 on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
GPT-5.5 on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=false, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=1.24cm per-metric={"chest":1.08,"waist":2.17,"hip":1.58,"inseam":0.91,"shoulder":0.45}16/16passed
  • Recommended size top-1 accuracytop1=8/810/10passed
  • Recommended size within ±1within1=8/86/6passed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

69Model load (ms)
4Estimate latency (ms)
1.24Mean abs. error (cm)
1.08MAE chest (cm)
2.17MAE waist (cm)
1.58MAE hip (cm)
0.91MAE inseam (cm)
0.45MAE shoulder (cm)
1Size top-1
1Holdout error (before)
0Holdout error (after)
0Off-allowlist requests
Full model profile →
#4

Cursor Composer 2.5

97
  • Tier Full
  • Agent 89/100
  • Auto 70/70
  • Contract ✓

Strong probabilistic UX with a confidence meter, per-metric likely ranges, and a clear chest-driven 'why this size' narrative plus return-adjustment copy. Privacy messaging is repeated and credible; minor gaps are no explicit low-confidence warning state and a pre-checked consent checkbox.

Cursor Composer 2.5 on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
Cursor Composer 2.5 on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
Cursor Composer 2.5 on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 70/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=1.09cm per-metric={"chest":1.08,"waist":2.17,"hip":1.58,"inseam":0.28,"shoulder":0.35}16/16passed
  • Recommended size top-1 accuracytop1=8/810/10passed
  • Recommended size within ±1within1=8/86/6passed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 014/14passed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2549Model load (ms)
105Estimate latency (ms)
1.09Mean abs. error (cm)
1.08MAE chest (cm)
2.17MAE waist (cm)
1.58MAE hip (cm)
0.28MAE inseam (cm)
0.35MAE shoulder (cm)
1Size top-1
1Holdout error (before)
0Holdout error (after)
0Off-allowlist requests
Full model profile →
#5

Grok Build 0.1

54
  • Tier Low
  • Agent 86/100
  • Auto 27/70
  • Contract ✓

The app communicates uncertainty well via a confidence bar, confidence-scaled measurement ranges, and explicit copy to trust ranges over point values. Return-feedback adjustment is explained clearly when applied, on-device privacy is repeated throughout, and the implementation goes beyond the brief with MediaPipe pose detection, silhouette fallback, and accessible UI controls—though low-confidence estimates lack a dedicated warning and only chest is named as the sizing driver.

Grok Build 0.1 on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
Grok Build 0.1 on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
Grok Build 0.1 on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 27/70
  • Loads & produces well-formed metricswell-formed { measurements, size, confidence }8/8passed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=20.99cm per-metric={"chest":38.57,"waist":23.51,"hip":11.55,"inseam":20.65,"shoulder":10.65}0/16failed
  • Recommended size top-1 accuracytop1=1/81/10failed
  • Recommended size within ±1within1=2/82/6failed
  • Online learning lowers holdout error (CORE)holdout err 1 -> 10/14failed
  • Graceful no-person / bad-image / non-imagenoperson:no_person✓ bad:bad_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

2030Model load (ms)
105Estimate latency (ms)
20.99Mean abs. error (cm)
38.57MAE chest (cm)
23.51MAE waist (cm)
11.55MAE hip (cm)
20.65MAE inseam (cm)
10.65MAE shoulder (cm)
0.13Size top-1
1Holdout error (before)
1Holdout error (after)
0Off-allowlist requests
Full model profile →
#6

Claude Sonnet 4.6 (high)

48
  • Tier Low
  • Agent 88/100
  • Auto 20/70
  • Contract ✓

Strong probabilistic UX with a confidence bar, tiered confidence copy, per-metric ranges, and a full size probability chart, plus explicit return-adjustment messaging. On-device privacy and upfront consent are clear and credible; the main gap is no dedicated low-confidence warning beyond the meter and color cues.

Claude Sonnet 4.6 (high) on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
Claude Sonnet 4.6 (high) on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
Claude Sonnet 4.6 (high) on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 20/70
  • Loads & produces well-formed metricsnot well-formed ({"ok":true,"value":{"measurements":null,"size":null,"confidence":0,"error":"bad_image"},"ms":5})4/8failed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=20cm per-metric={"chest":20,"waist":20,"hip":20,"inseam":20,"shoulder":20}0/16failed
  • Recommended size top-1 accuracytop1=0/80/10failed
  • Recommended size within ±1within1=0/80/6failed
  • Online learning lowers holdout error (CORE)holdout err 5 -> 50/14failed
  • Graceful no-person / bad-image / non-imagenoperson:bad_image✓ bad:bad_image✓ notimage:bad_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

8345Model load (ms)
1Estimate latency (ms)
20Mean abs. error (cm)
20MAE chest (cm)
20MAE waist (cm)
20MAE hip (cm)
20MAE inseam (cm)
20MAE shoulder (cm)
0Size top-1
5Holdout error (before)
5Holdout error (after)
0Off-allowlist requests
Full model profile →
#7

Kimi K2.5

42
  • Tier Low
  • Agent 58/100
  • Auto 20/70
  • Contract ✓

The app communicates uncertainty well via a color-coded confidence meter and per-measurement ± ranges, with strong upfront on-device privacy messaging. Gaps remain: no explicit low-confidence warnings, drivers are prose-only (alternatives unused), and consent relies on a banner rather than an upfront opt-in gate.

Kimi K2.5 on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
Kimi K2.5 on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
Kimi K2.5 on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 20/70
  • Loads & produces well-formed metricsnot well-formed ({"ok":true,"value":{"measurements":null,"size":null,"confidence":0,"error":"bad_image"},"ms":3})4/8failed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=20cm per-metric={"chest":20,"waist":20,"hip":20,"inseam":20,"shoulder":20}0/16failed
  • Recommended size top-1 accuracytop1=0/80/10failed
  • Recommended size within ±1within1=0/80/6failed
  • Online learning lowers holdout error (CORE)holdout err 5 -> 50/14failed
  • Graceful no-person / bad-image / non-imagenoperson:bad_image✓ bad:bad_image✓ notimage:bad_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

7175Model load (ms)
3Estimate latency (ms)
20Mean abs. error (cm)
20MAE chest (cm)
20MAE waist (cm)
20MAE hip (cm)
20MAE inseam (cm)
20MAE shoulder (cm)
0Size top-1
5Holdout error (before)
5Holdout error (after)
0Off-allowlist requests
18Console errors
Full model profile →
#8

Gemini 3.1 Pro

42
  • Tier Low
  • Agent 74/100
  • Auto 20/70
  • Contract ✓

Uncertainty is communicated well through a confidence bar, low-confidence warning, and confidence-scaled measurement ranges, with strong repeated on-device privacy copy and consent gating. Gaps include a single point size recommendation, generic return-feedback messaging that does not explain direction or magnitude, and only the primary chest metric is named as the sizing driver.

Gemini 3.1 Pro on In-browser Fashion Fit Estimation — Desktop · Light
Desktop · Light
Gemini 3.1 Pro on In-browser Fashion Fit Estimation — Desktop · Dark
Desktop · Dark
Gemini 3.1 Pro on In-browser Fashion Fit Estimation — Mobile
Mobile
Live preview — interactive app

Loads the actual app generated by this model.Open in a new tab ↗

Probe breakdown — automated 20/70
  • Loads & produces well-formed metricsnot well-formed ({"ok":true,"value":{"measurements":null,"size":null,"confidence":0,"error":"non_image"},"ms":8})4/8failed
  • Real on-device model + runtime loadedreal model (external=true, backend=webgpu)6/6passed
  • Measurement accuracy vs gold (MAE)meanMAE=20cm per-metric={"chest":20,"waist":20,"hip":20,"inseam":20,"shoulder":20}0/16failed
  • Recommended size top-1 accuracytop1=0/80/10failed
  • Recommended size within ±1within1=0/80/6failed
  • Online learning lowers holdout error (CORE)holdout err 5 -> 50/14failed
  • Graceful no-person / bad-image / non-imagenoperson:non_image✓ bad:non_image✓ notimage:non_image✓4/4passed
  • No image egress / on-device onlyoffAllowlist=0 bigUploadsAfterEstimate=06/6passed

Key metrics

20070Model load (ms)
4Estimate latency (ms)
20Mean abs. error (cm)
20MAE chest (cm)
20MAE waist (cm)
20MAE hip (cm)
20MAE inseam (cm)
20MAE shoulder (cm)
0Size top-1
5Holdout error (before)
5Holdout error (after)
0Off-allowlist requests
19Console errors
Full model profile →