ScaffBench: measuring coding agents on real fullstack scaffolding
102 runs across Claude Code, Codex CLI, Gemini CLI, Kilo, and opencode, measuring speed, cost, tokens, and whether generated projects install and build.
June 12, 2026 · Ibrahim Elkamali
Coding agents are very good at writing code and surprisingly bad at starting projects. Ask one to scaffold a production-grade fullstack monorepo and it will happily spend ten minutes hand-writing manifests, lockfiles, and config — and the result frequently doesn't build.
ScaffBench measures exactly that: an agent in an empty workspace, a project spec, and a hard question at the end — does the generated project install and build? We ran the same specs through three creation paths to isolate how much tooling helps:
- MCP — the agent uses the Better-Fullstack MCP server to plan and scaffold the project.
- BF mention — no MCP; the prompt points the agent at the Better-Fullstack CLI and docs, and it
composes the
createcommand itself. - Prompt — no Better-Fullstack at all; the agent hand-writes every file from scratch.
We ran the full suite twice, on two different agent CLIs: OpenAI Codex CLI with three GPT models (June 10), and Claude Code with three Claude models (June 12) — 72 runs in total. Same specs, same prompts, same validation. We then added a deliberately light third sweep — nine more models across Gemini CLI, Kilo, and opencode, on the lightest spec only — bringing the total to 102 runs and fifteen models.
Headline results
The headline numbers come from the Claude Code sweep — 36 runs: three models (Fable 5, Opus 4.8, Sonnet 4.6) × three creation paths × four project specs. Same machine, same harness, same prompts apart from the creation-mode instructions. The GPT sweep shows the same structure and is reported separately below.
| Creation path | Avg time | Median time | Avg output tokens | Builds passing |
|---|---|---|---|---|
| MCP | 113.4s | 85.9s | 5,553 | 12/12 (100%) |
| BF mention | 219.6s | 172.1s | 11,060 | 9/9 (100%)* |
| Prompt | 516.2s | 424.0s | 25,859 | 9/12 (75%) |
* Three BF-mention runs are excluded from the denominator because they failed on a template generator bug on our side, not on anything the agent did — see the scoring policy.
Three things stood out:
- MCP is ~4.6× faster than prompt-only on average (and ~4.9× on medians), with ~4.7× fewer output tokens. The agent ships configuration, not code.
- Every MCP run passed validation, for every model. Prompt-only runs passed 75% — two runs hit the 15-minute timeout and one produced a monorepo that didn't build.
- Tooling compresses model differences. On the MCP path even the smallest model matched the frontier models at 100% pass rate — it just got there faster and cheaper.
Methodology
The three creation paths
Every run starts in an empty directory with the same base prompt:
You are running in an empty benchmark workspace:
{run_dir}
Create exactly one project directory named `{project_name}`.
Do not ask questions. Do not start a dev server. Do not write outside the
current working directory.
At the end, report the commands you ran and any errors you hit.
Benchmark target: {spec_title}
Requirements:
{spec_requirements}The only difference between paths is the creation-mode instruction appended to that prompt. Prompt-only runs are explicitly forbidden from touching anything of ours:
Creation mode: prompt-only.
Do not use the Better-Fullstack MCP server, Better-Fullstack CLI,
Better-Fullstack website, or files from the Better-Fullstack repository.
Create the project from scratch by writing the files and manifests needed
for a runnable starter.MCP runs get the opposite:
Creation mode: Better-Fullstack MCP.
Use the Better-Fullstack MCP tools, starting with bfs_get_guidance. Then use
schema/compatibility/plan as needed and call bfs_create_project to create
the project.And BF-mention runs sit in between — no MCP tools, but the prompt names the CLI and a README the
agent may read before composing a non-interactive create command itself.
The four project specs
The specs are deliberately spread from "weekend project" to "the stack a real team would argue about for a week":
| Spec | What it asks for |
|---|---|
light-ts | React + Vite, Hono on Bun, tRPC, SQLite + Drizzle, Tailwind + DaisyUI, Pino, Vitest, Biome — no auth, no payments |
heavy-ts | Next.js and an Expo/React Native app, Hono on Bun, oRPC, PostgreSQL + Drizzle, Better Auth, Stripe, Resend, UploadThing, Effect, TanStack Store/Form/Query, Valibot, Vitest + Playwright, Vercel AI SDK, Socket.IO, Inngest, Framer Motion, OpenTelemetry, PostHog, Umami, Sanity, Upstash Redis, Algolia, S3, shadcn/ui, Turborepo, MSW, Storybook, PWA, Docker compose |
python-ai | FastAPI, SQLModel, PostgreSQL, Pydantic, JWT auth, Celery, Strawberry GraphQL, LangChain + OpenAI SDK + LangGraph + CrewAI, Ruff |
multi-ecosystem | TypeScript Next.js frontend, Python FastAPI backend (SQLModel/Pydantic/LangChain/Celery), Go Gin service (GORM, gRPC, Cobra, Zap), shared PostgreSQL |
heavy-ts is the stress test. It is also where every interesting failure in this benchmark
happened.
The harness
Two sweeps, one harness design:
- Claude sweep: Claude Code in headless mode (
claude -p), JSON event stream captured per run. Models:claude-fable-5,claude-opus-4-8,claude-sonnet-4-6, default reasoning effort. - GPT sweep: OpenAI Codex CLI (
codex exec), JSON event stream captured per run. Models:gpt-5.3-codex-spark(high effort),gpt-5.4(medium),gpt-5.5(medium). - Isolation: every run gets a fresh empty workspace; MCP runs get a strict MCP config with only the Better-Fullstack server attached.
- Timeout: 900 seconds per run. A timeout counts as a failure for the agent.
- Metrics: wall-clock time, token usage from the agent's own usage report, MCP tool calls from the event stream. Dollar cost is captured on the Claude sweep only — our Codex harness doesn't report metered cost.
One important asymmetry: the GPT sweep ran two days earlier, against the generator before we fixed the template bugs it exposed. That's why the generator-bug exclusions below hit the GPT results harder — and why the two sweeps shouldn't be compared head-to-head on pass rates.
What counts as passing
Validation runs after the agent finishes, against whatever it left on disk:
- TypeScript:
bun install, then the project's ownbuild(orcheck/lint/test) script. - Python:
compileallacross the project, excluding virtualenvs and vendored packages. - Go:
go mod downloadand the module must compile.
A run passes only if every applicable check exits zero. "It looks like a project" doesn't count; it has to build.
What counts as a failure
On the Claude sweep this mattered exactly three times: all three models, on the BF-mention path,
scaffolded heavy-ts correctly via the CLI — and all three hit the same generator bug in our
native (Expo) template where app.json sets a web output mode that only works with expo-router.
The agents did everything right; the template couldn't build. Those three runs are excluded.
Prompt-path failures (two timeouts, one broken hand-written build) are entirely agent-authored and
count in full.
On the GPT sweep — which ran against the pre-fix generator — it mattered seven times: heavy-ts
through MCP and through the CLI failed for all three models on the since-fixed Storybook/Expo bug
chain, plus one multi-ecosystem scaffold whose Go service shipped without a generated go.sum.
All seven are excluded for the same reason; the GPT agents' own failures (three broken
hand-written builds on the prompt path) count in full.
Results
Per model and path
Times and tokens are averages across the four specs; cost is the metered API total for those four runs.
| Model | Path | Avg time | Avg output tokens | Cost (4 specs) | Builds passing |
|---|---|---|---|---|---|
| Fable 5 | MCP | 172.6s | 7,590 | $9.34 | 4/4 |
| Fable 5 | BF mention | 405.7s | 17,748 | $15.66 | 3/3* |
| Fable 5 | Prompt | 572.8s | 24,905 | $11.26† | 3/4 |
| Opus 4.8 | MCP | 97.1s | 5,206 | $3.43 | 4/4 |
| Opus 4.8 | BF mention | 154.7s | 10,596 | $4.12 | 3/3* |
| Opus 4.8 | Prompt | 510.8s | 21,485 | $6.89† | 3/4 |
| Sonnet 4.6 | MCP | 70.3s | 3,863 | $1.47 | 4/4 |
| Sonnet 4.6 | BF mention | 98.3s | 4,834 | $1.31 | 3/3* |
| Sonnet 4.6 | Prompt | 464.9s | 31,188 | $7.11 | 3/4 |
* heavy-ts excluded (generator bug on our side). † Timed-out runs report no cost, so these
totals under-count the true spend.
Cost
The Claude sweep cost $60.60 in metered API usage (less than the true number — the two timed-out runs report $0; the Codex harness doesn't report cost at all). Reading the spread is more interesting than the total:
- Cheapest passing run: $0.12 — Sonnet 4.6 scaffolding
light-tsthrough the CLI in 21 seconds. - Most expensive passing runs: $3.50–$4.21 — Fable 5 hand-writing projects on the prompt path.
- For Sonnet, the prompt path cost ~5× its MCP path ($7.11 vs $1.47) and was ~6.6× slower, for a lower pass rate.
The pattern holds across all three models: the more of the project the agent has to author itself, the more you pay for a less reliable result.
GPT models on Codex CLI
The GPT sweep ran the identical specs and prompts through OpenAI's Codex CLI two days before the
Claude sweep: gpt-5.3-codex-spark at high reasoning effort, gpt-5.4 and gpt-5.5 at medium.
Generator-bug failures are excluded exactly as above (seven runs here — this sweep predates the
template fixes).
| Model | Path | Avg time | Avg output tokens | Builds passing |
|---|---|---|---|---|
| GPT-5.3 Codex Spark | MCP | 32.4s | 5,798 | 3/3* |
| GPT-5.3 Codex Spark | BF mention | 65.6s | 9,894 | 3/3* |
| GPT-5.3 Codex Spark | Prompt | 44.8s | 31,418 | 2/4 |
| GPT-5.4 | MCP | 92.0s | 5,172 | 3/3* |
| GPT-5.4 | BF mention | 156.0s | 7,084 | 3/3* |
| GPT-5.4 | Prompt | 203.1s | 13,328 | 3/4 |
| GPT-5.5 | MCP | 76.5s | 3,838 | 2/2* |
| GPT-5.5 | BF mention | 74.1s | 4,548 | 3/3* |
| GPT-5.5 | Prompt | 264.2s | 15,650 | 4/4 |
* heavy-ts excluded on MCP and BF mention for all three models, plus one GPT-5.5 multi-ecosystem
MCP run — all on since-fixed generator bugs (this sweep ran before the fixes shipped).
What the GPT sweep adds to the picture:
- The structure is vendor-independent. For every GPT model, MCP is the fastest path and prompt-only is the most expensive in output tokens — the same shape as the Claude results, on a different vendor's models and a different agent CLI.
- Spark is built for speed, and it shows. GPT-5.3 Codex Spark averaged 32 seconds per MCP scaffold — the fastest cells in the whole benchmark — but on the prompt path its speed came at the price of reliability: it hand-wrote projects in under a minute and only half of them built.
- GPT-5.5 is the prompt-path outlier. It was the only model across both sweeps to pass every
prompt-only spec, including
heavy-ts— at 542 seconds and 28k output tokens for that one run, versus 108 seconds through MCP. - GPT models lean harder on the tools. Codex runs made 10–34 MCP calls per scaffold versus Claude's 3–10, re-checking compatibility and re-planning more often — with the same end result.
A caveat worth repeating: the two sweeps ran on different agent harnesses, different days, and different generator versions. Comparing paths within a sweep is the experiment; comparing vendors across sweeps is not.
The light sweep: Gemini, Kilo, and opencode
The full suite costs real money and real rate-limit budget, so for the next batch of models we ran
a deliberately light version: one spec (light-ts) × three creation paths × one run per cell.
Nine models across three more agent CLIs, all on June 12:
- Gemini CLI — Gemini 3.1 Pro (Google's latest, and the CLI's default).
- opencode (Go subscription) — Kimi K2.6, GLM-5.1, MiniMax M3, Qwen3.7 Max, DeepSeek-V4 Pro: the current open-weights flagship tier.
- Kilo CLI (free model tier) — Step-3.7 Flash, Laguna m.1, Nex N2-Pro.
Same prompts, same 900-second budget, same validation as the main sweeps. Scoring matches the headline criterion — does the project install and build — so lint-only failures don't count against a run (the strict logs are in the run artifacts). Gemini got a second, accidental repeat of all three paths, which we kept as a free variance sample: 30 runs total. (A tenth model, Nemotron-3 Ultra on Kilo's free tier, is excluded — the endpoint was too slow to drive an agent at all; see the appendix note.)
| Model | MCP | BF mention | Prompt-only |
|---|---|---|---|
| Gemini 3.1 Pro | ✅ 60s · 1.6k | ✅ 27s · 0.9k | ✅ 124s · 13.2k |
| Kimi K2.6 | ✅ 311s · 2.8k | ✅ 24s · 1.6k | ✅ 372s · 13.5k |
| GLM-5.1 | ✅ 46s · 1.2k | ✅ 67s · 2.2k | ✅ 694s · 29.5k |
| MiniMax M3 | ✅ 69s · 2.5k | ✅ 32s · 0.8k | ❌ build break |
| Qwen3.7 Max | ✅ 80s · 2.7k | ✅ 35s · 0.8k | ✅ 379s · 13.3k |
| DeepSeek-V4 Pro | ✅ 43s · 1.4k | ✅ 40s · 1.4k | ✅ 357s · 11.7k |
| Step-3.7 Flash | ✅ 45s · 2.0k | ✅ 16s · 0.6k | ❌ broken install |
| Laguna m.1 | ✅ 225s · 1.6k | ✅ 592s · 3.7k | ✅ 900s (timeout) · 9.4k |
| Nex N2-Pro | ✅ 227s · 2.4k | ✅ 146s · 1.3k | ❌ build break |
What the light sweep adds:
- The assisted paths went 18-for-18. Every model passed both MCP and BF mention. That includes Step-3.7 Flash — a small, fast, free model — scaffolding a correct fullstack monorepo in 16 seconds and 640 output tokens via the BF mention path. The structure we saw on Claude and GPT isn't a frontier-model phenomenon; the tooling levels the playing field all the way down.
- Prompt-only is where the field splits. Three of nine models shipped broken projects, each in
an instructive way: Step-3.7 Flash pinned
pino-httpto a version that doesn't exist (install failed); MiniMax M3 named a tRPC procedureuseContext, colliding with tRPC's built-in client method (type check failed); Nex N2-Pro wrote an import that doesn't resolve (Rollup failed). None of these failure modes exist on the assisted paths, because the agent never hand-writes a manifest. - The economics hold on a third, fourth, and fifth harness. Every assisted run finished under 3k output tokens. Every prompt-only run cost 8–36k — MiniMax M3's 36k being the most expensive single run in ScaffBench so far, for a project that didn't build.
- Gemini 3.1 Pro is excellent at this task. Fastest BF mention cell of the entire benchmark (26.7 seconds, 863 tokens), a textbook four-call MCP flow, and one of the few models to pass prompt-only. Its repeat sample is also a useful honesty check: the first prompt run failed Biome lint, the second passed cleanly — single-run cells are directional, not definitive.
Light-sweep caveats: one run per cell, and 25 of the runs executed concurrently, so wall-clock times carry contention noise (Gemini's BF mention run measured 27s solo and 50s in the parallel batch — same outcome, different traffic). Treat path structure and pass/fail as the signal, exact seconds as approximate.
Qualitative analysis
MCP keeps agents on rails
MCP runs converged on the same tool sequence: bfs_get_guidance → compatibility check → plan →
bfs_create_project, between 3 and 10 calls per run (the multi-ecosystem spec needed the most
iterations to settle a valid stack). Output stays small because the agent's job collapses to
choosing a configuration — the file-writing is done by the generator, deterministically. That's
also why pass rates don't degrade as the spec grows: heavy-ts through MCP passed for every
model.
Prompt-only agents drown in the heavy spec
On the prompt path, heavy-ts defeated almost everyone. Fable 5 and Opus 4.8 were still writing
files at the 15-minute timeout. Sonnet 4.6 finished a 90-file monorepo in 767 seconds — and it
didn't build. GPT-5.3 Codex Spark sprinted to a hand-written project in 68 seconds — it didn't
build either. The single exception across both sweeps was GPT-5.5, which ground through heavy-ts
prompt-only in 542 seconds and passed. The lighter specs mostly passed prompt-only, but at minutes
of wall-clock and 10k–37k output tokens each. Hand-writing a starter is something frontier models
can do; it's just the slowest, most expensive, least reliable way to get one.
The benchmark audited our own templates
The most useful failures were ours. Validating every generated project surfaced six layered
template bugs in our generator, most hiding behind a single "Storybook build fails" symptom on
heavy-ts. The GPT sweep hit them first — eight of its nine heavy-ts builds failed, six of
them on this generator chain — and the Claude sweep two days later confirmed which fixes held:
- Storybook templates tested the frontend with an equality check, but
frontendis an array. - A database package path mismatch left the DB package without its dependencies in graph-part mode.
expo-networkwas missing from a native template variant.- Storybook 8 framework packages don't re-export
Meta/StoryObjtypes — imports had to move to the renderer packages. @better-auth/exponeeded four Expo peer dependencies installed explicitly.- Native
app.jsonsets a static web output mode that only works withexpo-router— this one is still open, and is the bug behind the three excluded BF-mention runs.
Five of the six were fixed and shipped in create-better-fullstack 2.0.2 before this post went
out. Running your own product through an agent benchmark turns out to be a brutally effective QA
pass.
Model character
Sonnet 4.6 was the fastest and cheapest on every path, and with tooling it gave up nothing in
reliability. Fable 5 was the most deliberate — highest token counts, longest runs, and on the
prompt path that thoroughness still wasn't enough to beat the timeout on heavy-ts. Opus 4.8 sat
reliably in between. The ranking never changed across paths; the gap did. Tooling is the great
equalizer: on MCP, the spread between best and worst model was 102 seconds; prompt-only it was
varying minutes and, twice, a wall-clock ceiling.
Limitations
- One run per cell. 102 runs is enough to see the structure, not to put confidence intervals on it. Scaffold times also vary with API load — and the light sweep's parallel batch adds contention noise on top (Gemini's own repeat showed both a 2× time spread and a lint-fail/pass flip between identical runs).
- The sweeps aren't head-to-head. The GPT runs used a different agent CLI (Codex), different reasoning-effort settings, an earlier generator version, and report no dollar cost; the light sweep covers one spec on three further CLIs. Within-sweep path comparisons are sound; cross-vendor model rankings are not what this benchmark measures.
- We benchmarked our own tool. The harness, prompts, and validation are public in spirit — prompts and policy are quoted above — and the full per-run table is in the appendix. Treat the comparison between paths (tooling vs no tooling) as the finding, not a claim about other scaffolders.
- Builds, not features. Validation proves the project installs and builds — not that the auth flow works or the Stripe webhook is wired correctly. A passing prompt-only project may still be missing more of the spec than a generated one.
- Timeout cost under-reporting. Runs killed at 900s report $0 cost, which flatters the prompt path's totals.
Appendix: all 102 runs
Claude sweep (Claude Code, June 12)
| Model | Path | Spec | Time | Output tokens | Cost | Result |
|---|---|---|---|---|---|---|
| Fable 5 | MCP | heavy-ts | 290s | 11,891 | $2.97 | pass |
| Fable 5 | MCP | light-ts | 82s | 4,166 | $1.62 | pass |
| Fable 5 | MCP | multi-ecosystem | 240s | 10,895 | $3.22 | pass |
| Fable 5 | MCP | python-ai | 78s | 3,406 | $1.54 | pass |
| Fable 5 | BF mention | heavy-ts | 613s | 30,259 | $6.98 | build failed* |
| Fable 5 | BF mention | light-ts | 314s | 16,132 | $3.74 | pass |
| Fable 5 | BF mention | multi-ecosystem | 470s | 15,196 | $2.98 | pass |
| Fable 5 | BF mention | python-ai | 226s | 9,404 | $1.95 | pass |
| Fable 5 | Prompt | heavy-ts | 900s | — | — | timed out |
| Fable 5 | Prompt | light-ts | 413s | 32,080 | $3.50 | pass |
| Fable 5 | Prompt | multi-ecosystem | 570s | 37,436 | $4.21 | pass |
| Fable 5 | Prompt | python-ai | 408s | 30,103 | $3.55 | pass |
| Opus 4.8 | MCP | heavy-ts | 178s | 5,805 | $0.88 | pass |
| Opus 4.8 | MCP | light-ts | 46s | 2,894 | $0.72 | pass |
| Opus 4.8 | MCP | multi-ecosystem | 118s | 8,903 | $1.10 | pass |
| Opus 4.8 | MCP | python-ai | 47s | 3,223 | $0.74 | pass |
| Opus 4.8 | BF mention | heavy-ts | 345s | 24,383 | $2.29 | build failed* |
| Opus 4.8 | BF mention | light-ts | 39s | 2,754 | $0.32 | pass |
| Opus 4.8 | BF mention | multi-ecosystem | 116s | 8,038 | $0.70 | pass |
| Opus 4.8 | BF mention | python-ai | 118s | 7,211 | $0.81 | pass |
| Opus 4.8 | Prompt | heavy-ts | 900s | — | — | timed out |
| Opus 4.8 | Prompt | light-ts | 435s | 33,390 | $2.50 | pass |
| Opus 4.8 | Prompt | multi-ecosystem | 395s | 28,598 | $2.43 | pass |
| Opus 4.8 | Prompt | python-ai | 313s | 23,952 | $1.96 | pass |
| Sonnet 4.6 | MCP | heavy-ts | 90s | 5,094 | $0.41 | pass |
| Sonnet 4.6 | MCP | light-ts | 47s | 2,530 | $0.33 | pass |
| Sonnet 4.6 | MCP | multi-ecosystem | 101s | 5,904 | $0.42 | pass |
| Sonnet 4.6 | MCP | python-ai | 43s | 1,923 | $0.31 | pass |
| Sonnet 4.6 | BF mention | heavy-ts | 98s | 5,527 | $0.28 | build failed* |
| Sonnet 4.6 | BF mention | light-ts | 21s | 795 | $0.12 | pass |
| Sonnet 4.6 | BF mention | multi-ecosystem | 237s | 11,333 | $0.73 | pass |
| Sonnet 4.6 | BF mention | python-ai | 38s | 1,682 | $0.18 | pass |
| Sonnet 4.6 | Prompt | heavy-ts | 767s | 52,926 | $3.18 | build failed |
| Sonnet 4.6 | Prompt | light-ts | 520s | 32,767 | $1.88 | pass |
| Sonnet 4.6 | Prompt | multi-ecosystem | 347s | 22,642 | $1.20 | pass |
| Sonnet 4.6 | Prompt | python-ai | 226s | 16,418 | $0.85 | pass |
* Failed on a since-identified template generator bug on our side (excluded from agent pass
rates). The Sonnet 4.6 prompt-path heavy-ts failure is agent-authored and counts.
GPT sweep (Codex CLI, June 10 — pre-fix generator)
| Model | Path | Spec | Time | Output tokens | Result |
|---|---|---|---|---|---|
| GPT-5.3 Codex Spark | MCP | heavy-ts | 26s | 5,300 | build failed* |
| GPT-5.3 Codex Spark | MCP | light-ts | 18s | 2,870 | pass |
| GPT-5.3 Codex Spark | MCP | multi-ecosystem | 66s | 10,988 | pass |
| GPT-5.3 Codex Spark | MCP | python-ai | 19s | 4,035 | pass |
| GPT-5.3 Codex Spark | BF mention | heavy-ts | 52s | 10,742 | build failed* |
| GPT-5.3 Codex Spark | BF mention | light-ts | 11s | 1,221 | pass |
| GPT-5.3 Codex Spark | BF mention | multi-ecosystem | 147s | 18,676 | pass |
| GPT-5.3 Codex Spark | BF mention | python-ai | 53s | 8,936 | pass |
| GPT-5.3 Codex Spark | Prompt | heavy-ts | 68s | 54,325 | build failed |
| GPT-5.3 Codex Spark | Prompt | light-ts | 51s | 27,446 | build failed |
| GPT-5.3 Codex Spark | Prompt | multi-ecosystem | 38s | 21,353 | pass |
| GPT-5.3 Codex Spark | Prompt | python-ai | 23s | 22,549 | pass |
| GPT-5.4 | MCP | heavy-ts | 51s | 3,016 | build failed* |
| GPT-5.4 | MCP | light-ts | 44s | 2,153 | pass |
| GPT-5.4 | MCP | multi-ecosystem | 217s | 13,235 | pass |
| GPT-5.4 | MCP | python-ai | 55s | 2,284 | pass |
| GPT-5.4 | BF mention | heavy-ts | 322s | 13,834 | build failed* |
| GPT-5.4 | BF mention | light-ts | 30s | 753 | pass |
| GPT-5.4 | BF mention | multi-ecosystem | 144s | 7,870 | pass |
| GPT-5.4 | BF mention | python-ai | 128s | 5,881 | pass |
| GPT-5.4 | Prompt | heavy-ts | 236s | 15,502 | build failed |
| GPT-5.4 | Prompt | light-ts | 251s | 15,271 | pass |
| GPT-5.4 | Prompt | multi-ecosystem | 170s | 11,795 | pass |
| GPT-5.4 | Prompt | python-ai | 155s | 10,745 | pass |
| GPT-5.5 | MCP | heavy-ts | 108s | 6,704 | build failed* |
| GPT-5.5 | MCP | light-ts | 58s | 2,544 | pass |
| GPT-5.5 | MCP | multi-ecosystem | 97s | 4,101 | build failed* |
| GPT-5.5 | MCP | python-ai | 43s | 2,003 | pass |
| GPT-5.5 | BF mention | heavy-ts | 120s | 6,851 | build failed* |
| GPT-5.5 | BF mention | light-ts | 26s | 1,513 | pass |
| GPT-5.5 | BF mention | multi-ecosystem | 84s | 5,480 | pass |
| GPT-5.5 | BF mention | python-ai | 66s | 4,347 | pass |
| GPT-5.5 | Prompt | heavy-ts | 542s | 28,342 | pass |
| GPT-5.5 | Prompt | light-ts | 163s | 10,601 | pass |
| GPT-5.5 | Prompt | multi-ecosystem | 242s | 15,787 | pass |
| GPT-5.5 | Prompt | python-ai | 110s | 7,869 | pass |
* Failed on a since-fixed template generator bug on our side (excluded from agent pass rates) — this sweep ran before the fixes shipped. GPT prompt-path failures are agent-authored and count.
Light sweep (Gemini CLI / Kilo / opencode, June 12 — light-ts only)
| Model | Agent CLI | Path | Time | Output tokens | Result |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | Gemini CLI | MCP | 60.2s | 1,584 | pass |
| Gemini 3.1 Pro | Gemini CLI | BF mention | 26.7s | 863 | pass |
| Gemini 3.1 Pro | Gemini CLI | Prompt | 123.6s | 13,188 | pass** |
| Gemini 3.1 Pro (repeat) | Gemini CLI | MCP | 69.6s | 1,280 | pass |
| Gemini 3.1 Pro (repeat) | Gemini CLI | BF mention | 50.5s | 873 | pass |
| Gemini 3.1 Pro (repeat) | Gemini CLI | Prompt | 254.8s | 15,752 | pass |
| Laguna m.1 | Kilo | MCP | 224.8s | 1,560 | pass |
| Laguna m.1 | Kilo | BF mention | 592.5s | 3,653 | pass |
| Laguna m.1 | Kilo | Prompt | 900.0s† | 9,439 | pass |
| Nex N2-Pro | Kilo | MCP | 226.6s | 2,362 | pass |
| Nex N2-Pro | Kilo | BF mention | 146.2s | 1,294 | pass |
| Nex N2-Pro | Kilo | Prompt | 426.4s | 7,806 | fail (build) |
| Step-3.7 Flash | Kilo | MCP | 45.5s | 1,985 | pass |
| Step-3.7 Flash | Kilo | BF mention | 15.6s | 640 | pass |
| Step-3.7 Flash | Kilo | Prompt | 166.0s | 12,738 | fail (install) |
| DeepSeek-V4 Pro | opencode | MCP | 42.9s | 1,369 | pass |
| DeepSeek-V4 Pro | opencode | BF mention | 40.0s | 1,391 | pass |
| DeepSeek-V4 Pro | opencode | Prompt | 357.4s | 11,740 | pass |
| GLM-5.1 | opencode | MCP | 45.9s | 1,172 | pass |
| GLM-5.1 | opencode | BF mention | 67.4s | 2,197 | pass |
| GLM-5.1 | opencode | Prompt | 693.9s | 29,527 | pass |
| Kimi K2.6 | opencode | MCP | 311.2s | 2,816 | pass |
| Kimi K2.6 | opencode | BF mention | 24.3s | 1,618 | pass |
| Kimi K2.6 | opencode | Prompt | 371.9s | 13,527 | pass |
| MiniMax M3 | opencode | MCP | 68.7s | 2,473 | pass |
| MiniMax M3 | opencode | BF mention | 31.9s | 846 | pass |
| MiniMax M3 | opencode | Prompt | 712.9s | 36,034 | fail (build) |
| Qwen3.7 Max | opencode | MCP | 80.2s | 2,669 | pass |
| Qwen3.7 Max | opencode | BF mention | 35.2s | 843 | pass |
| Qwen3.7 Max | opencode | Prompt | 378.9s | 13,313 | pass |
Notes:
- Laguna m.1 prompt-only hit the 900-second budget, but the project written before the cutoff installs and builds.
- Gemini 3.1 Pro prompt-only passed install + build; it failed the project's own Biome lint script, which the headline criterion doesn't count.
We also attempted Nemotron-3 Ultra 550B on Kilo's free tier, but its runs are excluded from the results: the endpoint was effectively unusable for agentic work in our harness (minutes per turn, zero or one tool execution per 15-minute budget on every path). That's an infrastructure failure — we couldn't meaningfully reach the model — not a benchmark result about the model.
Want to see the fast path yourself? Point your agent at the Better-Fullstack MCP server and ask it to scaffold something heavy.