Model comparison

All evaluated models side by side — capability profile, scores, and the actual result.

FunctionalVisualUXEngineeringAI Quality
  • Claude Opus 4.8 (high)
  • GPT-5.5
  • Grok Build 0.1
  • Cursor Composer 2.5
  • GLM 5.2
  • Kimi K2.5
  • Gemini 3.1 Pro

Metrics

DimensionClaude Opus 4.8 (high)GPT-5.5Grok Build 0.1Cursor Composer 2.5GLM 5.2Kimi K2.5Gemini 3.1 Pro
Total113.5110.2109.1105.596.187.836
Agent score84787871666510
Functional20/2019.6/2019/2020/2020/2020/2012.6/20
Visual Design20/2019/2019/2018/2015/2014/204/20
UX18/2018/2018/2018/2015/2015/204/20
Engineering18.5/2019.4/2017.1/2018.5/2016.1/2014.8/2011.2/20
AI Quality20/2020/2020/2018/2016/2015/205/20
Lighthouse Perf79958686918397
Lighthouse SEO100100100100100100100
axe violations2112341
Run time1503.7s428.8s447.1s152.2s903.6s903.8s358.3s

Results side by side

Claude Opus 4.8 (high) — generated product page
Claude Opus 4.8 (high)113.5
GPT-5.5 — generated product page
GPT-5.5110.2
Grok Build 0.1 — generated product page
Grok Build 0.1109.1
Cursor Composer 2.5 — generated product page
Cursor Composer 2.5105.5
GLM 5.2 — generated product page
GLM 5.296.1
Kimi K2.5 — generated product page
Kimi K2.587.8
Gemini 3.1 Pro — generated product page
Gemini 3.1 Pro36