ScaffBench: measuring coding agents on real fullstack scaffolding

102 runs across Claude Code, Codex CLI, Gemini CLI, Kilo, and opencode, measuring speed, cost, tokens, and whether generated projects install and build.

June 12, 2026 · Ibrahim Elkamali

benchmarkmcpclaude-codeagents

Coding agents are very good at writing code and surprisingly bad at starting projects. Ask one to scaffold a production-grade fullstack monorepo and it will happily spend ten minutes hand-writing manifests, lockfiles, and config — and the result frequently doesn't build.

ScaffBench measures exactly that: an agent in an empty workspace, a project spec, and a hard question at the end — does the generated project install and build? We ran the same specs through three creation paths to isolate how much tooling helps:

MCP — the agent uses the Better-Fullstack MCP server to plan and scaffold the project.
BF mention — no MCP; the prompt points the agent at the Better-Fullstack CLI and docs, and it composes the create command itself.
Prompt — no Better-Fullstack at all; the agent hand-writes every file from scratch.

We ran the full suite twice, on two different agent CLIs: OpenAI Codex CLI with three GPT models (June 10), and Claude Code with three Claude models (June 12) — 72 runs in total. Same specs, same prompts, same validation. We then added a deliberately light third sweep — nine more models across Gemini CLI, Kilo, and opencode, on the lightest spec only — bringing the total to 102 runs and fifteen models.

Headline results

The headline numbers come from the Claude Code sweep — 36 runs: three models (Fable 5, Opus 4.8, Sonnet 4.6) × three creation paths × four project specs. Same machine, same harness, same prompts apart from the creation-mode instructions. The GPT sweep shows the same structure and is reported separately below.

Creation path	Avg time	Median time	Avg output tokens	Builds passing
MCP	113.4s	85.9s	5,553	12/12 (100%)
BF mention	219.6s	172.1s	11,060	9/9 (100%)*
Prompt	516.2s	424.0s	25,859	9/12 (75%)

* Three BF-mention runs are excluded from the denominator because they failed on a template generator bug on our side, not on anything the agent did — see the scoring policy.

Three things stood out:

MCP is ~4.6× faster than prompt-only on average (and ~4.9× on medians), with ~4.7× fewer output tokens. The agent ships configuration, not code.
Every MCP run passed validation, for every model. Prompt-only runs passed 75% — two runs hit the 15-minute timeout and one produced a monorepo that didn't build.
Tooling compresses model differences. On the MCP path even the smallest model matched the frontier models at 100% pass rate — it just got there faster and cheaper.

Methodology

The three creation paths

Every run starts in an empty directory with the same base prompt:

You are running in an empty benchmark workspace:
{run_dir}

Create exactly one project directory named `{project_name}`.
Do not ask questions. Do not start a dev server. Do not write outside the
current working directory.
At the end, report the commands you ran and any errors you hit.

Benchmark target: {spec_title}
Requirements:
{spec_requirements}

The only difference between paths is the creation-mode instruction appended to that prompt. Prompt-only runs are explicitly forbidden from touching anything of ours:

Creation mode: prompt-only.
Do not use the Better-Fullstack MCP server, Better-Fullstack CLI,
Better-Fullstack website, or files from the Better-Fullstack repository.
Create the project from scratch by writing the files and manifests needed
for a runnable starter.

MCP runs get the opposite:

Creation mode: Better-Fullstack MCP.
Use the Better-Fullstack MCP tools, starting with bfs_get_guidance. Then use
schema/compatibility/plan as needed and call bfs_create_project to create
the project.

And BF-mention runs sit in between — no MCP tools, but the prompt names the CLI and a README the agent may read before composing a non-interactive create command itself.

The four project specs

The specs are deliberately spread from "weekend project" to "the stack a real team would argue about for a week":

Spec	What it asks for
`light-ts`	React + Vite, Hono on Bun, tRPC, SQLite + Drizzle, Tailwind + DaisyUI, Pino, Vitest, Biome — no auth, no payments
`heavy-ts`	Next.js and an Expo/React Native app, Hono on Bun, oRPC, PostgreSQL + Drizzle, Better Auth, Stripe, Resend, UploadThing, Effect, TanStack Store/Form/Query, Valibot, Vitest + Playwright, Vercel AI SDK, Socket.IO, Inngest, Framer Motion, OpenTelemetry, PostHog, Umami, Sanity, Upstash Redis, Algolia, S3, shadcn/ui, Turborepo, MSW, Storybook, PWA, Docker compose
`python-ai`	FastAPI, SQLModel, PostgreSQL, Pydantic, JWT auth, Celery, Strawberry GraphQL, LangChain + OpenAI SDK + LangGraph + CrewAI, Ruff
`multi-ecosystem`	TypeScript Next.js frontend, Python FastAPI backend (SQLModel/Pydantic/LangChain/Celery), Go Gin service (GORM, gRPC, Cobra, Zap), shared PostgreSQL

heavy-ts is the stress test. It is also where every interesting failure in this benchmark happened.

The harness

Two sweeps, one harness design:

Claude sweep: Claude Code in headless mode (claude -p), JSON event stream captured per run. Models: claude-fable-5, claude-opus-4-8, claude-sonnet-4-6, default reasoning effort.
GPT sweep: OpenAI Codex CLI (codex exec), JSON event stream captured per run. Models: gpt-5.3-codex-spark (high effort), gpt-5.4 (medium), gpt-5.5 (medium).
Isolation: every run gets a fresh empty workspace; MCP runs get a strict MCP config with only the Better-Fullstack server attached.
Timeout: 900 seconds per run. A timeout counts as a failure for the agent.
Metrics: wall-clock time, token usage from the agent's own usage report, MCP tool calls from the event stream. Dollar cost is captured on the Claude sweep only — our Codex harness doesn't report metered cost.

One important asymmetry: the GPT sweep ran two days earlier, against the generator before we fixed the template bugs it exposed. That's why the generator-bug exclusions below hit the GPT results harder — and why the two sweeps shouldn't be compared head-to-head on pass rates.

What counts as passing

Validation runs after the agent finishes, against whatever it left on disk:

TypeScript: bun install, then the project's own build (or check/lint/test) script.
Python: compileall across the project, excluding virtualenvs and vendored packages.
Go: go mod download and the module must compile.

A run passes only if every applicable check exits zero. "It looks like a project" doesn't count; it has to build.

What counts as a failure

On the Claude sweep this mattered exactly three times: all three models, on the BF-mention path, scaffolded heavy-ts correctly via the CLI — and all three hit the same generator bug in our native (Expo) template where app.json sets a web output mode that only works with expo-router. The agents did everything right; the template couldn't build. Those three runs are excluded. Prompt-path failures (two timeouts, one broken hand-written build) are entirely agent-authored and count in full.

On the GPT sweep — which ran against the pre-fix generator — it mattered seven times: heavy-ts through MCP and through the CLI failed for all three models on the since-fixed Storybook/Expo bug chain, plus one multi-ecosystem scaffold whose Go service shipped without a generated go.sum. All seven are excluded for the same reason; the GPT agents' own failures (three broken hand-written builds on the prompt path) count in full.

Results

Per model and path

Times and tokens are averages across the four specs; cost is the metered API total for those four runs.

Model	Path	Avg time	Avg output tokens	Cost (4 specs)	Builds passing
Fable 5	MCP	172.6s	7,590	$9.34	4/4
Fable 5	BF mention	405.7s	17,748	$15.66	3/3*
Fable 5	Prompt	572.8s	24,905	$11.26†	3/4
Opus 4.8	MCP	97.1s	5,206	$3.43	4/4
Opus 4.8	BF mention	154.7s	10,596	$4.12	3/3*
Opus 4.8	Prompt	510.8s	21,485	$6.89†	3/4
Sonnet 4.6	MCP	70.3s	3,863	$1.47	4/4
Sonnet 4.6	BF mention	98.3s	4,834	$1.31	3/3*
Sonnet 4.6	Prompt	464.9s	31,188	$7.11	3/4

* heavy-ts excluded (generator bug on our side). † Timed-out runs report no cost, so these totals under-count the true spend.

Cost

The Claude sweep cost $60.60 in metered API usage (less than the true number — the two timed-out runs report $0; the Codex harness doesn't report cost at all). Reading the spread is more interesting than the total:

Cheapest passing run: $0.12 — Sonnet 4.6 scaffolding light-ts through the CLI in 21 seconds.
Most expensive passing runs: $3.50–$4.21 — Fable 5 hand-writing projects on the prompt path.
For Sonnet, the prompt path cost ~5× its MCP path ($7.11 vs $1.47) and was ~6.6× slower, for a lower pass rate.

The pattern holds across all three models: the more of the project the agent has to author itself, the more you pay for a less reliable result.

GPT models on Codex CLI

The GPT sweep ran the identical specs and prompts through OpenAI's Codex CLI two days before the Claude sweep: gpt-5.3-codex-spark at high reasoning effort, gpt-5.4 and gpt-5.5 at medium. Generator-bug failures are excluded exactly as above (seven runs here — this sweep predates the template fixes).

Model	Path	Avg time	Avg output tokens	Builds passing
GPT-5.3 Codex Spark	MCP	32.4s	5,798	3/3*
GPT-5.3 Codex Spark	BF mention	65.6s	9,894	3/3*
GPT-5.3 Codex Spark	Prompt	44.8s	31,418	2/4
GPT-5.4	MCP	92.0s	5,172	3/3*
GPT-5.4	BF mention	156.0s	7,084	3/3*
GPT-5.4	Prompt	203.1s	13,328	3/4
GPT-5.5	MCP	76.5s	3,838	2/2*
GPT-5.5	BF mention	74.1s	4,548	3/3*
GPT-5.5	Prompt	264.2s	15,650	4/4

* heavy-ts excluded on MCP and BF mention for all three models, plus one GPT-5.5 multi-ecosystem MCP run — all on since-fixed generator bugs (this sweep ran before the fixes shipped).

What the GPT sweep adds to the picture:

The structure is vendor-independent. For every GPT model, MCP is the fastest path and prompt-only is the most expensive in output tokens — the same shape as the Claude results, on a different vendor's models and a different agent CLI.
Spark is built for speed, and it shows. GPT-5.3 Codex Spark averaged 32 seconds per MCP scaffold — the fastest cells in the whole benchmark — but on the prompt path its speed came at the price of reliability: it hand-wrote projects in under a minute and only half of them built.
GPT-5.5 is the prompt-path outlier. It was the only model across both sweeps to pass every prompt-only spec, including heavy-ts — at 542 seconds and 28k output tokens for that one run, versus 108 seconds through MCP.
GPT models lean harder on the tools. Codex runs made 10–34 MCP calls per scaffold versus Claude's 3–10, re-checking compatibility and re-planning more often — with the same end result.

A caveat worth repeating: the two sweeps ran on different agent harnesses, different days, and different generator versions. Comparing paths within a sweep is the experiment; comparing vendors across sweeps is not.

The light sweep: Gemini, Kilo, and opencode

The full suite costs real money and real rate-limit budget, so for the next batch of models we ran a deliberately light version: one spec (light-ts) × three creation paths × one run per cell. Nine models across three more agent CLIs, all on June 12:

Gemini CLI — Gemini 3.1 Pro (Google's latest, and the CLI's default).
opencode (Go subscription) — Kimi K2.6, GLM-5.1, MiniMax M3, Qwen3.7 Max, DeepSeek-V4 Pro: the current open-weights flagship tier.
Kilo CLI (free model tier) — Step-3.7 Flash, Laguna m.1, Nex N2-Pro.

Same prompts, same 900-second budget, same validation as the main sweeps. Scoring matches the headline criterion — does the project install and build — so lint-only failures don't count against a run (the strict logs are in the run artifacts). Gemini got a second, accidental repeat of all three paths, which we kept as a free variance sample: 30 runs total. (A tenth model, Nemotron-3 Ultra on Kilo's free tier, is excluded — the endpoint was too slow to drive an agent at all; see the appendix note.)

Model	MCP	BF mention	Prompt-only
Gemini 3.1 Pro	✅ 60s · 1.6k	✅ 27s · 0.9k	✅ 124s · 13.2k
Kimi K2.6	✅ 311s · 2.8k	✅ 24s · 1.6k	✅ 372s · 13.5k
GLM-5.1	✅ 46s · 1.2k	✅ 67s · 2.2k	✅ 694s · 29.5k
MiniMax M3	✅ 69s · 2.5k	✅ 32s · 0.8k	❌ build break
Qwen3.7 Max	✅ 80s · 2.7k	✅ 35s · 0.8k	✅ 379s · 13.3k
DeepSeek-V4 Pro	✅ 43s · 1.4k	✅ 40s · 1.4k	✅ 357s · 11.7k
Step-3.7 Flash	✅ 45s · 2.0k	✅ 16s · 0.6k	❌ broken install
Laguna m.1	✅ 225s · 1.6k	✅ 592s · 3.7k	✅ 900s (timeout) · 9.4k
Nex N2-Pro	✅ 227s · 2.4k	✅ 146s · 1.3k	❌ build break

What the light sweep adds:

The assisted paths went 18-for-18. Every model passed both MCP and BF mention. That includes Step-3.7 Flash — a small, fast, free model — scaffolding a correct fullstack monorepo in 16 seconds and 640 output tokens via the BF mention path. The structure we saw on Claude and GPT isn't a frontier-model phenomenon; the tooling levels the playing field all the way down.
Prompt-only is where the field splits. Three of nine models shipped broken projects, each in an instructive way: Step-3.7 Flash pinned pino-http to a version that doesn't exist (install failed); MiniMax M3 named a tRPC procedure useContext, colliding with tRPC's built-in client method (type check failed); Nex N2-Pro wrote an import that doesn't resolve (Rollup failed). None of these failure modes exist on the assisted paths, because the agent never hand-writes a manifest.
The economics hold on a third, fourth, and fifth harness. Every assisted run finished under 3k output tokens. Every prompt-only run cost 8–36k — MiniMax M3's 36k being the most expensive single run in ScaffBench so far, for a project that didn't build.
Gemini 3.1 Pro is excellent at this task. Fastest BF mention cell of the entire benchmark (26.7 seconds, 863 tokens), a textbook four-call MCP flow, and one of the few models to pass prompt-only. Its repeat sample is also a useful honesty check: the first prompt run failed Biome lint, the second passed cleanly — single-run cells are directional, not definitive.

Light-sweep caveats: one run per cell, and 25 of the runs executed concurrently, so wall-clock times carry contention noise (Gemini's BF mention run measured 27s solo and 50s in the parallel batch — same outcome, different traffic). Treat path structure and pass/fail as the signal, exact seconds as approximate.

Qualitative analysis

MCP keeps agents on rails

MCP runs converged on the same tool sequence: bfs_get_guidance → compatibility check → plan → bfs_create_project, between 3 and 10 calls per run (the multi-ecosystem spec needed the most iterations to settle a valid stack). Output stays small because the agent's job collapses to choosing a configuration — the file-writing is done by the generator, deterministically. That's also why pass rates don't degrade as the spec grows: heavy-ts through MCP passed for every model.

Prompt-only agents drown in the heavy spec

On the prompt path, heavy-ts defeated almost everyone. Fable 5 and Opus 4.8 were still writing files at the 15-minute timeout. Sonnet 4.6 finished a 90-file monorepo in 767 seconds — and it didn't build. GPT-5.3 Codex Spark sprinted to a hand-written project in 68 seconds — it didn't build either. The single exception across both sweeps was GPT-5.5, which ground through heavy-ts prompt-only in 542 seconds and passed. The lighter specs mostly passed prompt-only, but at minutes of wall-clock and 10k–37k output tokens each. Hand-writing a starter is something frontier models can do; it's just the slowest, most expensive, least reliable way to get one.

The benchmark audited our own templates

The most useful failures were ours. Validating every generated project surfaced six layered template bugs in our generator, most hiding behind a single "Storybook build fails" symptom on heavy-ts. The GPT sweep hit them first — eight of its nine heavy-ts builds failed, six of them on this generator chain — and the Claude sweep two days later confirmed which fixes held:

Storybook templates tested the frontend with an equality check, but frontend is an array.
A database package path mismatch left the DB package without its dependencies in graph-part mode.
expo-network was missing from a native template variant.
Storybook 8 framework packages don't re-export Meta/StoryObj types — imports had to move to the renderer packages.
@better-auth/expo needed four Expo peer dependencies installed explicitly.
Native app.json sets a static web output mode that only works with expo-router — this one is still open, and is the bug behind the three excluded BF-mention runs.

Five of the six were fixed and shipped in create-better-fullstack 2.0.2 before this post went out. Running your own product through an agent benchmark turns out to be a brutally effective QA pass.

Model character

Sonnet 4.6 was the fastest and cheapest on every path, and with tooling it gave up nothing in reliability. Fable 5 was the most deliberate — highest token counts, longest runs, and on the prompt path that thoroughness still wasn't enough to beat the timeout on heavy-ts. Opus 4.8 sat reliably in between. The ranking never changed across paths; the gap did. Tooling is the great equalizer: on MCP, the spread between best and worst model was 102 seconds; prompt-only it was varying minutes and, twice, a wall-clock ceiling.

Limitations

One run per cell. 102 runs is enough to see the structure, not to put confidence intervals on it. Scaffold times also vary with API load — and the light sweep's parallel batch adds contention noise on top (Gemini's own repeat showed both a 2× time spread and a lint-fail/pass flip between identical runs).
The sweeps aren't head-to-head. The GPT runs used a different agent CLI (Codex), different reasoning-effort settings, an earlier generator version, and report no dollar cost; the light sweep covers one spec on three further CLIs. Within-sweep path comparisons are sound; cross-vendor model rankings are not what this benchmark measures.
We benchmarked our own tool. The harness, prompts, and validation are public in spirit — prompts and policy are quoted above — and the full per-run table is in the appendix. Treat the comparison between paths (tooling vs no tooling) as the finding, not a claim about other scaffolders.
Builds, not features. Validation proves the project installs and builds — not that the auth flow works or the Stripe webhook is wired correctly. A passing prompt-only project may still be missing more of the spec than a generated one.
Timeout cost under-reporting. Runs killed at 900s report $0 cost, which flatters the prompt path's totals.

Appendix: all 102 runs

Claude sweep (Claude Code, June 12)

Model	Path	Spec	Time	Output tokens	Cost	Result
Fable 5	MCP	heavy-ts	290s	11,891	$2.97	pass
Fable 5	MCP	light-ts	82s	4,166	$1.62	pass
Fable 5	MCP	multi-ecosystem	240s	10,895	$3.22	pass
Fable 5	MCP	python-ai	78s	3,406	$1.54	pass
Fable 5	BF mention	heavy-ts	613s	30,259	$6.98	build failed*
Fable 5	BF mention	light-ts	314s	16,132	$3.74	pass
Fable 5	BF mention	multi-ecosystem	470s	15,196	$2.98	pass
Fable 5	BF mention	python-ai	226s	9,404	$1.95	pass
Fable 5	Prompt	heavy-ts	900s	—	—	timed out
Fable 5	Prompt	light-ts	413s	32,080	$3.50	pass
Fable 5	Prompt	multi-ecosystem	570s	37,436	$4.21	pass
Fable 5	Prompt	python-ai	408s	30,103	$3.55	pass
Opus 4.8	MCP	heavy-ts	178s	5,805	$0.88	pass
Opus 4.8	MCP	light-ts	46s	2,894	$0.72	pass
Opus 4.8	MCP	multi-ecosystem	118s	8,903	$1.10	pass
Opus 4.8	MCP	python-ai	47s	3,223	$0.74	pass
Opus 4.8	BF mention	heavy-ts	345s	24,383	$2.29	build failed*
Opus 4.8	BF mention	light-ts	39s	2,754	$0.32	pass
Opus 4.8	BF mention	multi-ecosystem	116s	8,038	$0.70	pass
Opus 4.8	BF mention	python-ai	118s	7,211	$0.81	pass
Opus 4.8	Prompt	heavy-ts	900s	—	—	timed out
Opus 4.8	Prompt	light-ts	435s	33,390	$2.50	pass
Opus 4.8	Prompt	multi-ecosystem	395s	28,598	$2.43	pass
Opus 4.8	Prompt	python-ai	313s	23,952	$1.96	pass
Sonnet 4.6	MCP	heavy-ts	90s	5,094	$0.41	pass
Sonnet 4.6	MCP	light-ts	47s	2,530	$0.33	pass
Sonnet 4.6	MCP	multi-ecosystem	101s	5,904	$0.42	pass
Sonnet 4.6	MCP	python-ai	43s	1,923	$0.31	pass
Sonnet 4.6	BF mention	heavy-ts	98s	5,527	$0.28	build failed*
Sonnet 4.6	BF mention	light-ts	21s	795	$0.12	pass
Sonnet 4.6	BF mention	multi-ecosystem	237s	11,333	$0.73	pass
Sonnet 4.6	BF mention	python-ai	38s	1,682	$0.18	pass
Sonnet 4.6	Prompt	heavy-ts	767s	52,926	$3.18	build failed
Sonnet 4.6	Prompt	light-ts	520s	32,767	$1.88	pass
Sonnet 4.6	Prompt	multi-ecosystem	347s	22,642	$1.20	pass
Sonnet 4.6	Prompt	python-ai	226s	16,418	$0.85	pass

* Failed on a since-identified template generator bug on our side (excluded from agent pass rates). The Sonnet 4.6 prompt-path heavy-ts failure is agent-authored and counts.

GPT sweep (Codex CLI, June 10 — pre-fix generator)

Model	Path	Spec	Time	Output tokens	Result
GPT-5.3 Codex Spark	MCP	heavy-ts	26s	5,300	build failed*
GPT-5.3 Codex Spark	MCP	light-ts	18s	2,870	pass
GPT-5.3 Codex Spark	MCP	multi-ecosystem	66s	10,988	pass
GPT-5.3 Codex Spark	MCP	python-ai	19s	4,035	pass
GPT-5.3 Codex Spark	BF mention	heavy-ts	52s	10,742	build failed*
GPT-5.3 Codex Spark	BF mention	light-ts	11s	1,221	pass
GPT-5.3 Codex Spark	BF mention	multi-ecosystem	147s	18,676	pass
GPT-5.3 Codex Spark	BF mention	python-ai	53s	8,936	pass
GPT-5.3 Codex Spark	Prompt	heavy-ts	68s	54,325	build failed
GPT-5.3 Codex Spark	Prompt	light-ts	51s	27,446	build failed
GPT-5.3 Codex Spark	Prompt	multi-ecosystem	38s	21,353	pass
GPT-5.3 Codex Spark	Prompt	python-ai	23s	22,549	pass
GPT-5.4	MCP	heavy-ts	51s	3,016	build failed*
GPT-5.4	MCP	light-ts	44s	2,153	pass
GPT-5.4	MCP	multi-ecosystem	217s	13,235	pass
GPT-5.4	MCP	python-ai	55s	2,284	pass
GPT-5.4	BF mention	heavy-ts	322s	13,834	build failed*
GPT-5.4	BF mention	light-ts	30s	753	pass
GPT-5.4	BF mention	multi-ecosystem	144s	7,870	pass
GPT-5.4	BF mention	python-ai	128s	5,881	pass
GPT-5.4	Prompt	heavy-ts	236s	15,502	build failed
GPT-5.4	Prompt	light-ts	251s	15,271	pass
GPT-5.4	Prompt	multi-ecosystem	170s	11,795	pass
GPT-5.4	Prompt	python-ai	155s	10,745	pass
GPT-5.5	MCP	heavy-ts	108s	6,704	build failed*
GPT-5.5	MCP	light-ts	58s	2,544	pass
GPT-5.5	MCP	multi-ecosystem	97s	4,101	build failed*
GPT-5.5	MCP	python-ai	43s	2,003	pass
GPT-5.5	BF mention	heavy-ts	120s	6,851	build failed*
GPT-5.5	BF mention	light-ts	26s	1,513	pass
GPT-5.5	BF mention	multi-ecosystem	84s	5,480	pass
GPT-5.5	BF mention	python-ai	66s	4,347	pass
GPT-5.5	Prompt	heavy-ts	542s	28,342	pass
GPT-5.5	Prompt	light-ts	163s	10,601	pass
GPT-5.5	Prompt	multi-ecosystem	242s	15,787	pass
GPT-5.5	Prompt	python-ai	110s	7,869	pass

* Failed on a since-fixed template generator bug on our side (excluded from agent pass rates) — this sweep ran before the fixes shipped. GPT prompt-path failures are agent-authored and count.

Light sweep (Gemini CLI / Kilo / opencode, June 12 — light-ts only)

Model	Agent CLI	Path	Time	Output tokens	Result
Gemini 3.1 Pro	Gemini CLI	MCP	60.2s	1,584	pass
Gemini 3.1 Pro	Gemini CLI	BF mention	26.7s	863	pass
Gemini 3.1 Pro	Gemini CLI	Prompt	123.6s	13,188	pass**
Gemini 3.1 Pro (repeat)	Gemini CLI	MCP	69.6s	1,280	pass
Gemini 3.1 Pro (repeat)	Gemini CLI	BF mention	50.5s	873	pass
Gemini 3.1 Pro (repeat)	Gemini CLI	Prompt	254.8s	15,752	pass
Laguna m.1	Kilo	MCP	224.8s	1,560	pass
Laguna m.1	Kilo	BF mention	592.5s	3,653	pass
Laguna m.1	Kilo	Prompt	900.0s†	9,439	pass
Nex N2-Pro	Kilo	MCP	226.6s	2,362	pass
Nex N2-Pro	Kilo	BF mention	146.2s	1,294	pass
Nex N2-Pro	Kilo	Prompt	426.4s	7,806	fail (build)
Step-3.7 Flash	Kilo	MCP	45.5s	1,985	pass
Step-3.7 Flash	Kilo	BF mention	15.6s	640	pass
Step-3.7 Flash	Kilo	Prompt	166.0s	12,738	fail (install)
DeepSeek-V4 Pro	opencode	MCP	42.9s	1,369	pass
DeepSeek-V4 Pro	opencode	BF mention	40.0s	1,391	pass
DeepSeek-V4 Pro	opencode	Prompt	357.4s	11,740	pass
GLM-5.1	opencode	MCP	45.9s	1,172	pass
GLM-5.1	opencode	BF mention	67.4s	2,197	pass
GLM-5.1	opencode	Prompt	693.9s	29,527	pass
Kimi K2.6	opencode	MCP	311.2s	2,816	pass
Kimi K2.6	opencode	BF mention	24.3s	1,618	pass
Kimi K2.6	opencode	Prompt	371.9s	13,527	pass
MiniMax M3	opencode	MCP	68.7s	2,473	pass
MiniMax M3	opencode	BF mention	31.9s	846	pass
MiniMax M3	opencode	Prompt	712.9s	36,034	fail (build)
Qwen3.7 Max	opencode	MCP	80.2s	2,669	pass
Qwen3.7 Max	opencode	BF mention	35.2s	843	pass
Qwen3.7 Max	opencode	Prompt	378.9s	13,313	pass

Notes:

Laguna m.1 prompt-only hit the 900-second budget, but the project written before the cutoff installs and builds.
Gemini 3.1 Pro prompt-only passed install + build; it failed the project's own Biome lint script, which the headline criterion doesn't count.

We also attempted Nemotron-3 Ultra 550B on Kilo's free tier, but its runs are excluded from the results: the endpoint was effectively unusable for agentic work in our harness (minutes per turn, zero or one tool execution per 15-minute budget on every path). That's an infrastructure failure — we couldn't meaningfully reach the model — not a benchmark result about the model.

Want to see the fast path yourself? Point your agent at the Better-Fullstack MCP server and ask it to scaffold something heavy.